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ABSTRACT 

This  work  describes  a  large  number  of  constructions  for  sorting  N  numbers 
in  the  range  [0,Af]  for  the  standard  VLSI  bit  model.  Among  other  results, 
we  attain: 

•  VLSI  sorter  constructions  that  are  within  a  constant  factor  of  optimal  size 
for  almost  all  number  ranges  M  (including  M  =  N),  and  running  times  T. 

•  A  fundamentally  new  merging  network  for  sorting  numbers  in  a  bit 
model. 

•  New  organizational  approaches  for  optimal  tuning  of  merging  networks 
and  the  proper  management  of  data  flow. 

1.   Introduction 

We  describe  VLSI  circuits  that  sort  N  numbers,  in  the  range  0  to  Af  - 1;  the  sorters  have 
area  A,  and  running  time  0(r)^.  We  have  two  sorting  models:  perimeter  sorters,  which  have 
their  I/O  ports  on  the  boundary  of  the  circuit,  and  dense  sorters,  which  can  have  I/O  ports 
anywhere  in  the  circuit. 

The  results  divide  into  three  classes,  according  as  M <N,  N^M^N^,  and  N^-^M.  (The 
partition  at  N'^  is  somewhat  arbitrary;  it  could  be  at  any  fixed  power  of  N  exceeding  1.)  We 
give  sorting  circuits  for  each  of  these  classes,  for  both  the  perimeter  and  the  dense  models. 
The  bounds  Eire  given  in  the  tables  below;  one  column  in  the  tables  gives  the  ratio  of  our 
upper  bound  to  the  best  known  lower  bound.  Our  upper  bounds  match  the  lower  bounds  in 
almost  all  instances  (up  to  a  constant  factor).  Most  of  our  bounds  are  AT^  and  AT  tradeoffs. 
However,  for  very  large  values  of  M  other  tradeoffs  also  arise.  Interestingly  enough,  for 
these  values  of  M,  the  AT^  and  AT  tradeoffs  apply  to  two  related  problems:  ranking,  and 
sorting  in  the  case  that  the  inputs  can  be  read  twice. 

Among  other  results,  we  resolve  (in  the  affirmative)  the  question  of  tightness  for 
Thompson's  original  AT^=  Ci(N^)  lower  bound  [Thompson,  1979],  which  applies  to  essentially 
planar  circuits  that  sort  N  numbers  in  the  range  [0,//-!].  Specifically,  we  show  that  an 
AT^  =  Q(N-)  tradeoff  holds  for  T(i[logNlog' N,N^^].  For  T  =  N^^,  the  area  is  just  e{N),  the 
minimum  possible  area  for  sorting  N  numbers  in  the  range  0  to  N—1  [Ullman,  1984].  In 
fact,  our  circuits  achieve,  for  every  choice  of  M  and  (suitably)  maximal  T,  the  minimum  area 
possible  for  VLSI  sorters.    Furthermore,  these  small  circuits  are  aU  optimal  in  speed. 

The  theory  of  AT'^  bounds  for  VLSI  originates  in  [Thompson,  1979],  where  the  sorting 
bound  AT^  =  Cl(N^),  mentioned  above,  is  established.  [Thompson,  1980]  shows  that 
AT^  =  [l{N^og^N)  for  a  (word-local)  restricted  class  of  circuits  that  sort  A^  numbers  in  the 
range  [0,N^],  and  [Thompson,  1981]  demonstrates  a  circuit  construction  that  satisfies  this 
bound  for  T=0{N^).  In  [Leighton,  1984],  Thompson's  lower  bound  (for  M  =  N'')  is  shown 
to  hold  without  Thompson's  restrictions.  Moreover,  a  construction  is  given  that  meets  this 
N^og-N  bound  for  T  €  [logN,(NlogNy^].  [Bilardi  and  Prcparata,  1984]  give  an  independent 
(and  different)  construction  that  satisfies  the  same  bounds.  The  Leighton  and  Bilardi- 
Preparata  sorting  circuits  apply  for  any  fixed  range  of  size  M  =  N",  where  a  is  fixed  and 
greater  than  1. 

Minimum  area  bounds  for  sorters  appear  for  special  cases  in  [Ullman,  1984],  and 
[Leighton,  1984].  The  complete  solution  appears  in  [Siegel,  1984a]  and  was  independently 
discovered  by  [Duris,  Sykora,  Thompson,  and  Vrt'o,  1984].  In  his  thesis,  [Bilardi,  1984], 
shows  that  another  method  suffices  to  give  the  same  results.    The  AT'^  complexity  for  sorters, 

■•■Note  that  by  using  a  running  time  proportional  to  T,  we  can  describe  the  limits  for  T  without 
concern  for  the  fact  that  the  circuit  may  be  a  constant  factor  slower. 


in  the  case  N<M<N^,  originates  in  [Siegel,  1984b];  a  simpler  proof  of  this  result  is  given  in 
[Siegel,  1985].  In  his  thesis  [Bilardi,  1984]  shows  that  Leighton's  method  (given  in 
[Leighton,  1984]  for  M  =  N'')  can  be  extended  to  give  the  same  results.  The  AT"^  bounds  can 
be  adapted  to  apply  to  other  number  zones.  For  M»N^,  the  results  were  independently 
discovered  by  [Siegel,  1984c]  and  [Bilardi  and  Preparata,  1985b].  For  M<N,  the  bounds 
appear  in  [Siegel  1984a,  as  revised  for  publication,  1985],  and  are  implicit  in  an  independent 
result  of  [El  Gammal  and  Pang,  1984]. 

In  independent  work,  parallel  with  the  results  reported  in  this  paper,  [Bilardi,  1984]  and 
[Bilardi  and  Preparata,  1985a]  describe  several  sorting  circuits.  In  particular,  they  extend  the 
range  of  optimal  sorting  circuits  (for  both  dense  and  perimeter  sorters)  to  include 
M  e  [N\og'-'^N,N'^],  for  any  fixed  i,  where  log^'^^V  is  the  ith  iterate  of  the  logarithm;  they  give 
non-optimal  dense  sorter  constructions  for  M^N  (their  results  are  optimal  for  constant  M). 
Furthermore,  in  [Bilardi  and  Preparata,  1985a]  an  optimal  construction  for  dense  sorters,  for 
M^N^,  is  given,  satisfying  an  AT/logA  tradeoff;  a  matching  lower  bound  is  given  in  [Bilardi 
and  Preparata,  1985c].  Recently  AT  lower  bounds  for  dense  sorters  were  shown  in  [Bilardi 
and  Preparata,  1985b].  Our  constructions  show  that  the  AT  lower  bounds  are  tight.  Also, 
we  are  able  to  extend  the  AT  lower  bounds  to  give  AT  bounds  for  perimeter  sorters,  thereby 
showing  our  AT  upper  bounds  for  perimeter  sorters  are  tight,  in  most  instances. 

The  first  step  in  our  sorting  constructions  is  standard.  Both  [Leighton,  1984]  and 
[Bilardi  and  Preparata,  1984]  exploit  the  fact  that  to  sort  N  numbers,  when,  say,  M=N^,  it  is 
useful  to  use  many  smaller  sorters  and  then  to  somehow  merge  the  sorted  subsets.  The  crux 
of  our  constructions  is  how  to  perform  the  merging.  Leighton  achieves  the  merging  by  a 
clever  generalization  of  odd-even  merge;  as  a  consequence,  he  proves  that  the  AT^  complexity 
of  sorting  is  not  in  the  sorting  subcircuits,  but  rather  in  the  hardwired  permutation  networks 
used  to  implement  columnsort.  In  contrast,  it  is  provably  impossible  to  built  an  optimal  sorter 
that  utilizes  such  a  permutation  network  for  any  number  range  other  than  where 
A7^  =  @(N^og^N);  the  data  flow  would  be  much  too  large.  Thus  our  constructions  are  based 
on  a  variety  of  different  approaches. 

An  essential  step  is  the  recognition  that  we  need  to  use  data  compression  (encoding)  to 
achieve  optimal  constructions,  when  M  «  N^.  We  are  not  the  first  to  consider  encodings  in 
this  context.  [Loui,  1983]  used  encoding  to  obtain  results  about  the  number  of  bits  that  must 
be  communicated  when  sorting  (word-local  input  data)  on  a  ring  and  on  a  mesh.  His 
encoding  was  optimal  only  for  some  values  of  M ,  which  for  our  purposes  is  inadequate.  In 
[Siegel,  1984a]  an  optimal  encoding  scheme  is  given  (that  is  optimal  to  second  order  terms,  in 
fact).  We  employ  this  scheme.  Part  of  the  scheme  is  implicitly  presented  in  [Carter  et  al, 
1978].  In  [Bilardi,  1984]  and  in  [Duris  et  al,  1984]  different  optimal  schemes  are  proposed, 
which  are  optimal  to  within  constant  factors.  Thus  the  encoding  methods,  though  important, 
are  not  new  (and  quite  possibly  predate  modem  computer  science).  The  key  point  of  the 
encodings  is  that  a  set  of  X  nimibers  in  the  range  [0,R],  X^^R^X,  can  be  encoded  using 
0(X\og2R/X)  bits.  For  X^R,  the  encoding  requires  OiR\og2X/R)  bits.  Furthermore,  we  can 
merge  a  pair  of  encoded  sets  systolically  in  unit  time  per  bit.  More  details  are  given  in 
appendix  A. 

Many  of  our  results  rely  on  a  fundamentally  new  sorting  construction  that  we  call  an 
aligned  merging  network,  described  in  section  2.  All  of  our  sorting  results  depend  on 
carefully  tuned  organizational  techniques  based  on  the  data  flow  of  the  sorting  circuits.  Some 
constructions  exploit  an  unusual  form  of  pipelining:  the  running  times  of  the  pipeline  units 
are  not  uniform.  Our  use  of  this  technique  is  not  straightforward;  it  requires  some  care  to 
ensure  the  optimal  bounds  are  achieved.  This  technique  is  used,  for  example,  to  achieve  the 
optimal  bound  for  M  =  N,  when  T  =  N^^.  The  sorter  has  area  ©(A^)  which  is  sufficient  to 
store  the  encoded  input  data;  the  problem  is  that  the  unencoded  input  consists  of  NlogN  bits. 
The  difficulty  is  that  the  input  has  to  be  sorted  before  it  can  be  compressed  into  a 
representation  using  just  Q(N)  bits.    Our  solution  inputs  just  a  portion  of  the  input  values;  the 
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data  is  then  sorted  and  compressed,  and  the  process  is  iterated.  However,  after  some 
number  of  iterations,  we  are  obliged  to  apply  further  compression  to  the  data  that  has 
already  been  input.  We  have  several  recompressions;  each  successive  recompression  is 
carried  out  less  frequently,  but  takes  longer  to  perform.  The  hardest  part  of  the  construction 
lies  in  ensuring  that  the  sum  of  the  areas  used  for  performing  the  various  recompressions  is 
only  0{N);  this  requires  careful  balancing  of  running  times,  areas  used,  and  the  frequency  of 
data  recompression.  This  methodology  is  similar  to  the  funneled  pipelining  described  by 
[Hochschild,  Mayr,  Siegel,  1983],  and  developed  further  in  [Hochschild,  1985].  In  those 
works,  however,  the  technique  was  applied  to  graph  problems,  and  had  a  somewhat  different 
flavor,  since  the  purpose  of  the  technique  was  to  discard  data  outright,  and  extra  log  factors 
in  time  and  area  were  of  no  concern.  When  aiming  for  optimal  constructions,  these  factors 
are  important,  and  for  parallel  sorting,  the  entire  problem  is  a  matter  of  removing  the  log 
factors. 

It  is  worth  noting  that  our  constructions  typically  use  hybrid  merging  networks.  The 
reason  our  sorters  frequently  require  several  different  merging  organizations  is  best 
understood  by  examining  the  lower  bound  proofs.  Typically,  a  proof  follows  from 
demonstrating  that  a  sorting  circuit  has  an  information  bottleneck.  That  is,  there  is  a  region 
of  the  circuit  that  has  just  enough  perimeter  to  accommodate  the  information  flow  that  must, 
in  the  worst  case,  cross  the  boundary  during  a  specific  period  of  time.  When  M  <N ,  the  AT^ 
bottleneck  for  dense  sorters  occurs  for  a  (minimal)  region  that  inputs  MlogM  bits  (with 
information  content  n(M)).  For  N^^M-^N'^,  the  bottleneck  occurs  for  a  region  that  inputs 
N 
— logAf   bits  (with  information  content  £l{N\og2M/N)  ),  while  for  M>N^,  it  occurs  for  a 

region  that  inputs  NlogN  bits  (with  information  content  n(A^logA'^)).  Our  circuits  are 
constrained  by  these  bottlenecks;  in  fact,  our  constructions  typically  have  three  parts.  The 
first  uses  a  merging  network  that,  in  general,  sorts  a  set  of  inputs  with  a  size  just  below  the 
lower  bound  bottleneck.  The  second  is  a  network  that  performs  a  merge  of  data  at  the 
bottleneck,  and  the  third  is  a  network  that  completes  the  task.  Of  the  three,  the  third  is  by  far 
the  easiest  circuit  to  design,  and  typically,  the  second  is  the  most  delicate.  (Sometimes  the 
second  part  divides  into  two  subparts:  one,  performing  a  merge  of  data  so  that  the  merged  set 
has  a  size  equal  to  the  bottleneck;  the  other,  performing  a  merge  of  sets,  each  of  size  equal  to 
the  bottleneck.) 

In  the  following  sections  we  describe  the  various  sorting  networks.  All  constructions 
use,  as  subunits,  optimal  sorters  of  r  numbers  in  [0,r^],  as  constructed  in  [Leighton,  1984]; 
we  call  such  devices  circuit  switching  sorters,  because  they  can  form  routing  paths  that 
connect  each  input  port  with  the  appropriate  sort-determined  output  port.  Alternatively,  we 
could,  in  all  cases  but  one,  use  the  sorters  described  in  [Bilardi-Preparata,  1984]  (which  do 
not  compress  data  either).  We  will  also  need  several  merging  networks,  and  related 
constructs.  Their  specific  designs  require  careful  detail,  but  are  not  technically  difficult;  as  a 
consequence,  their  description  is  left  to  appendix  C.  Section  2  introduces  aligned  merging 
networks  by  describing  how  to  build  an  optimal  perimeter  sorter  for  M=N  and  T=logN. 
Section  3  describes  the  construction  of  an  optimal  dense  sorting  circuit  for  M  =  N  and 
r=  logiVlog*A^,  and  shows  how  the  information  flow  can  be  used  to  organize  what  is  clearly  a 
complicated,  delicately  tuned  cascade  of  merging  networks. 

Our  other  sorting  circuits,  for  MrsN^,  are  built  using  elaborations  of  the  techniques 
presented  in  sections  2  and  3,  and  are  described  in  sections  4,  5  and  6.  The  sorters  for 
M^N^  are  by  far  the  simplest  to  design;  they  are  described  in  sections  7  and  8.  They  rely  not 
on  data  compression,  but  rather  on  the  distributed  sorting  of  pieces  of  each  number.  Were  all 
bits  of  each  output  variable  to  emerge  from  a  local  region  of  circuitry,  the  requisite  data  flow 
would  be  much  too  large.  This  idea  of  distributing  pieces  of  each  number  among  many 
cooperating  processors  is  due  to  [Leighton  1984].  Our  contribution  in  this  case  is  to  show 
how  to  tune  the  circuits  for  optimal  results,  and  how  to  extend  these  ideas,  when  necessary. 
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via  new  organizations  of  circuit  cascades. 

Remark:  It  should  be  noted  that  our  constructions  will  show  how  to  obtain  an  encoded  form 
of  the  inputs,  which  is  implicitly  a  sorted  order.  By  essentially  reversing  the  methods  we 
describe  (but  with  much  less  effort),  one  can  obtain,  as  outputs,  the  inputs  in  sorted  order. 
Also,  it  should  be  noted  that  we  construct  bounded  degree  circuits  that  have  when-  and 
where-determinate  input/output  schedules,  as  defined  in  [Ullman,  1984]. 
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Dense  Sorters:  AT-  Upper  Bounds. 


size 

upper  bound 

ratio:  "PP^^^'^""^ 
lowerbound 

comments 

f^T^iNlogNY^,    where 

/=  loglogM  +  logAriog21og-i5^ 

logN 

N^ogNlogM 

1 

NlogN:SM:SN^ 

Nhog^^ 

1 

1 

N^M:<.N\ogN 
log//3(iV,iV//iV)s  r:s  (iVlog-^)i'^ 

N 

1 

2.3 

M  =  N 

\ogN:ST:S\ogN\og'N 

w 

here  T  =  iXogN 

2,3,4 

max(logJV  .logAZ  log'Af  )  :s  r  s  A/ "^ 

MN 

1 

2,3 

Remark:  We  achieve  tight  AT^  bounds  for  all  relevant  M,  and  for  most  of  the  relevant  time 
interval.  Note  that  A,  T,  and  ratios  of  upper  to  lower  bounds  should  all  be  understood  as 
given  up  to  a  constant  factor;  thus,  1  should  be  read  as,  0(1). 

Comment  1:  The  result  for  M=N^^'^,  Qt>0,  is  due  to  [Bilardi  and  Preparata  1984]  and 
[Leighton  1984]. 

Comment  2:  P(x,y)  =  least  i:  log^'^ArrS)-;  log'x  =  P(x,2). 

Comment  3:  For  M<N  we  can  write  AT^  bounds-for  T  €  [log;V,logA/log*A/]  similar  to  the 
bounds  for  M  =  N;  likewise  for  N<M<M\og^'^N ,  for  any  fixed  j,  and 
T  €  [log//,logA^p(/^,A//Ar)]. 

Comment  4:  Setting  /s  log*//- log*log*iV  makes  the  ratio  equal  to  1.  This  result  can  be 
extended  to  all  values  of  M . 
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Dense  Sorters:  AT  Upper  Bounds. 


size 

upper  bound 

ratio-  "PPg^fe^""^ 
lowerbound 

logA^ 

NlogM  (NlogNy^^ 

1 

NS.Mr^N'^ 

see  comment  below 

NM^^ 

1 

M^^log2N/M 

^  Tliis  bound  applies  to  ranking  circiuts  for  all  T  as  stated  above;  for  sorting  cirtuits,  it  is 

valid         subject         to         the         extra         restriction         that         Ta: —        ^^ — _  por 

2n<(Niog/o'^ 

maz[(//log//)'^,loglogA/]:sr<      ^^°S^      ,  (i.e.,  for  2^<-^°«^'^<A<N\ogM),  the  correct 


tradeoff  for   sorting  circuits   is   given   by 
[Bilardi  and  Preparata,  1985b].) 


AT 
logA 


=  A^logAf.     (This  latter  tradeoff  is   due   to 


Comment:    For  N^MSiN'^,  all  the  circuits,  up  to  the  minimum  area  circuit,  are  covered  by 
the  AT^  tradeoff. 
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Perimeter  Sorters:  AT^  Upper  Bounds. 


size 


upper  bound 


upperbnd 
lowerbnd 


N^<.M 


log// log 


log^ 


'<T^f,    where 


/=max[,(^logA^)'^/;^^y)'] 
log^  +  log// 


NhogNlogMg 


1  + 


_g_ 


log  (27/^) 


)' 


and  g  =  log 


logAf 
logN 


logN^T^iNlogNy^ 


NHogNlog^^ 


log//  s  r  s  (Af  log  Af )  "^log  — 

A/ 


2N 
NMlogMlog 


loglN/M 


log 


27 


+  1 


loglN/M 


it 


logN^T^f,    where 
f  =  miQ[MlogN  ,(M/logMy^log^N  ] 


MNIor-NIorM 
log(2TnogN) 


It 


t  The  lower  bounds  in  these  cases  are  for  a  weaker  model. 

*  This  bound  applies  to  ranking  circuits  for  7  indicated  above,   and  to  sorters  for  times 
^log//s7s^//i'^max[(-^^ — )'^,1]-      For     sorting     circuits     and     larger     times     (i.e.,     for 

//log//<A<//logA/):A7=0(max[//^'^logA/log^7^^,//logA/(//logA^)i^logi'^— ^^^^ — ]). 


'  NlogN 


NlogN 


Remark  1:  Note  that  A,  7,  and  ratios  of  upper  to  lower  bounds  should  all  be  understood  as 
given  up  to  a  constant  factor;  thus,  1  should  be  read  as  0(1). 

Remark  2:  For  MsiN^  we  believe  the  lower  bound  is  weak.  However,  we  can  build  a 
'verifier'  circuit  that  achieves  this  lower  bound.  That  is,  we  input  the  rank  of  each  item  as 
well  as  its  value,  in  a  where-  and  when-determinate  format,  and  the  circuit  verifies  these 
ranks.    This  suggests  it  may  be  hard  to  improve  the  lower  bound. 

Remark  3:  The  result  for  M=N^^'',  a>0,  is  due  to  [Bilardi  and  Preparata  1984]  and 
[Leighton  1984]. 
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Perimeter  Sorters:  AT  Upper  Bounds. 


size 

upper  bound 

upperbnd 
lowerbnd 

f:ST:^(N\ogNy^^^^^,    where 
logN 

NlogM  {NlogNy^  * 

1+ 

N^M:SN^ 

see  comment  below 

M 

AT  (M  log  A/ )"^ 

1 

f^T^^^Slo^M^,    where 
M^^logN 

f  =  min[Mlog;^,(-^)"2log2/^  ] 
logAf 

^        ^     '            log(2TAogN) 

it 

*  The  lower  bounds  in  these  cases  are  for  a  weaker  model. 

*  These  boimds  apply  to  ranking  circuits,  not  sorting  circuits.    For  bounds  on  sorting  circuits, 
see  the  table  of  AT^  results  for  perimeter  sorters. 

Comment:    For  N^M^N^,  all  sorting  circuits,  up  to  the  minimum  area  circuit,  are  covered 
by  the  AT^  tradeoff. 
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2.   Aligned  Merging 

We  illustrate  the  aligned  merge  by  describing  the  perimeter  sorter  for  M  =  N,  T=  logN. 

The  basic  strategy  is  to  group  the  input  numbers  into  logA^  sets  of  N/\ogN  inputs  each, 
called  portions.  The  part  of  the  circuit  that  handles  a  portion  of  inputs  is  called  a  block.  We 
number  the  blocks  1  through  logN .  The  blocks  are  laid  out  in  a  vertical  column.  Each  block 
includes  a  circuit  switching  sorter  running  in  time  O(log^).  The  outputs  of  these  sorters  are 
merged,  using  an  aligned  merge,  described  below.  Each  block  is  of  size 
0{N/logN)  X  0(N/\ogN),  so  the  perimeter  sorter  has  size  O^N)  x  0(N/\ogN). 

To  perform  the  merge  we  connect  the  blocks  using  0(N/\ogN)  merging  channels,  each 
of  constant  width.  Each  channel  has  one  connection  to  each  block.  For  each  channel,  for 
each  block,  we  wish  to  feed  a  packet  (containing  a  small  set  of  numbers),  and  to  merge  these 
logN  packets  in  O(log^)  time,  at  which  point  the  sort  should  be  complete.  To  enable  this, 
we  impose  a  uniformity  criterion:  each  packet  can  contain  at  most  logA^  numbers  drawn  from  a 
small  range  of  values  (O(logA^)).  So  the  first  task  is  to  divide  the  (locally  sorted)  inputs  in 
each  block  into  packets.    We  call  the  values  separating  the  packets  breakpoints. 

The  breakpoints  are  chosen  to  be  the  following  2N/logN  values  (some  of  which  may  be 
repeated):  every  logNth  item  in  the  sorted  order  in  each  block,  and  the  values  ilogN, 
0^i^(N—l)/logN .  The  breakpoints  are  broadcast  to  all  the  blocks,  which  can  easily  be 
done  in  0(logA'^)  time  using  the  0(N/\ogN)  channels.  With  each  breakpoint  we  record  two 
numbers,  the  block  number  and  the  position.  The  block  number  is  the  number  of  the  block  in 
which  the  breakpoint  originated,  or  zero  for  the  NAogN  breakpoints  given  the  values  ilog^, 
l^i^NAogN.  The  position  is  the  rank  of  the  breakpoint  in  the  set  of  sorted  items  in  the 
block  from  which  the  breakpoint  originated.  It  is  also  convenient  to  associate  with  each  input 
its  block  number  and  position. 

In  each  block,  we  sort  the  inputs  and  breakpoints.  (The  sort  key  for  input  and 
breakpoint  dau  is  the  triple  <  value,  block  number,  position>  which  is  ordered 
lexicographically.)  For  each  pair  of  adjacent  breakpoints  we  encode  the  items  between  them. 
Since  two  adjacent  breakpoints  are  at  most  log^  apart  in  value,  and  since  there  are  at  most 
logN  items  between  them,  this  encoding  uses  0(logN)  bits.  We  term  the  encoded  items  a 
packet.  The  packet  is  associated  with  the  breakpoint  preceding  the  items.  (It  is  easy  to  form 
the  packets  in  0{logN)  time.)  It  is  important  to  use  the  coding  that  is  optimal  for  the  final 
sorted  order  (i.e.  0(\ogN)  items  in  an  0(logN)  range).  This  preceding  allows  us  to  merge 
packets  with  the  same  header  both  quickly  and  on  the  fly.  We  remark  that  such  an  encoding 
is,  in  general,  not  optimal  for  the  outputs  of  individual  blocks. 

Next,  we  align  packets  with  equal  headers.  That  is,  in  each  block  we  sort  the  packets 
according  to  the  breakpoint  value  (using  a  circuit  switching  sorter,  running  in  time  0(logN)). 
The  packet  formation  and  alignment  is  essentially  a  routing  problem.  We  solve  it  by  creating 
IN/logN  dummy  packets  with  breakpoints  as  headers,  and  sorting  the  input  numbers  and 
dummies,  thus  interlacing  the  data  and  packet  headers.  The  data  is  then  transferred  to  the 
packet  headers  in  an  encoded  format.  Alignment  occurs  after  another  lexicographical  sort, 
where  packets  are  counted  larger  than  input  numbers.  Packets  with  the  same  breakpoint  are 
thus  aligned  in  the  same  position  in  the  different  blocks. 

There  is  still  one  detail  preventing  us  from  merging  the  aligned  packets:  a  distributed  set 
S  of  aligned  packets  may  contain  up  to  log-^  items  (logN  items  from  each  of  log^  blocks). 
Since  it  is  easy  to  compute  how  many  items  are  in  S  as  well  as  the  partial  sums,  (if  we  record 
with  each  packet  its  size),  we  can  allocate  an  appropriate  number  of  channels  for  merging  the 
items  in  S,  and  assign  local  packets  a  channel  number  so  that  0(logN)  numbers  will  be 
merged  on  any  channel.  (For  each  packet  P  we  need  to  compute  two  numbers:  the  global 
displacement,  the  number  of  channels  allocated  to  packets  from  sets  preceding  S,  and  the 
local  displacement,  the  number  of  channels  allocated  to  packets  in  S  preceding  P;  a  packet  in 
5  precedes  PCS  if  it  is  in  a  block  with  a  smaller  block  number  and  it  is  not  allocated  the 
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same  channel  as  P.  Details  are  left  to  the  reader.)  We  generate  a  corrected  set  of  dummy 
packet  headers  in  each  block,  and  perform  a  second  alignment  of  the  data  (with  "just"  two 
more  circuit  switching  sortings)  so  that  our  uniformity  criterion  is  now  globally  satisfied. 
(The  sorts  proceed  as  follows.  In  each  block  we  create  a  dummy  header  for  each  channel;  we 
sort  the  dummy  headers  together  with  the  packets,  the  key  being  the  channel  number.  Then 
we  copy  each  packet  to  the  dummy  having  the  same  channel  number;  this  dummy  is  adjacent 
to  the  packet  following  the  sort.  We  resort  the  data  lexicographically,  packets  being  counted 
\zigcx  than  dummies.  Now  each  dummy  is  aligned  at  the  correct  channel,  so  the  packets  are 
correctly  aligned.) 

Next,  we  merge  the  items  in  the  packets  allocated  to  one  channel,  on  the  fly,  in  time 
O(log^),  obtaining  a  sorted  packet,  also  of  length  0{\ogN).  Each  of  these  sorted  packets  is 
placed,  essentially,  in  the  block  from  which  it  should  eventually  be  output. 

Actually,  we  may  not  be  able  to  carry  out  this  last  step  because  the  |5|  items  associated 
with  the  same  breakpoint  have  not  been  completely  merged,  in  general.  However,  there  are 
only  two  (adjacent)  blocks  to  which  each  item  may  belong,  since  \S\^\og^N ,  so  each  packet  is 
routed  to  one  of  the  two  blocks  to  which  its  items  belong  (if  there  is  any  doubt).  We  also 
ensure  that  exactly  N/logN  items  are  placed  in  each  block.  (This  may  require  splitting  the 
output  of  some  channels  into  two  packets.)  The  packets  are  decoded,  and  the  items  in  each 
block  are  sorted,  once  more.  It  turns  out  that  every  item  is  within  log^N  places  of  its  correct 
sorted  position  and  thus  it  is  easy  to  complete  the  sort. 

In  summary,  the  essence  of  the  method  is  as  follows.  Divide  the  inputs  into  portions 
and  sort  each  portion.  Obtain  a  global  sample  of  the  inputs,  called  breakpoints,  which 
partition  the  sorted  inputs  relatively  evenly.  Using  the  breakpoints,  partition  each  portion  of 
inputs  into  packets.  Align  packets  that  cover  the  same  range  and  merge  the  items  they 
contain.  To  carry  out  the  merge  efficiently,  it  is  vital  that  the  numbers  in  each  packet  be 
coded  using  the  method  that  is  most  compact  for  the  output. 

The  merging  channel  is  trivial  to  construct,  but  nevertheless  should  be  sketched  in  more 
detail.  Partial  sums  are  easily  computed  on  a  binary  tree,  and  our  channel  is  in  fact,  an 
instance  of  a  tree.  Merges  are  done  systolically,  and  can  also  be  done  (using  distributed 
systolic  queues)  on  a  tree.  In  the  case  N  =  M,  and  7'=logiV,  a  channel  is  simply  a  depth 
log//  tree  with  logA^  leaves,  which  is  essentially  a  simple  chain!  This  is  why  the  channel  has 
width  0(1).  In  general,  we  must  extrapolate  this  simple  tree  structure  to  accommodate 
additional  leaves  (of  depth  0(T))  and  account  for  their  increased  layout  width.  The  nature  of 
such  trees  has  been  analyzed  in  [Yao,  1981].  In  any  case,  our  merging  paths  and  merge 
packets  are  always  of  length  0(T).  The  layout  height  of  such  trees,  though,  will  in  general 
add  to  the  width  of  the  merging  circuit  and  hence  the  area  of  the  sorter. 

It  may  be  appropriate  to  comment  on  the  sorting  method  and  the  encoding  scheme.  If 
we  consider  a  horizontal  cut  of  the  network  we  see  that  0{N)  bits  cross  it,  as  is  required  for 
an  optimal  sorter.  In  the  vertical  direction  however,  we  have  a  surprise:  0(N\ogN)  bits  cross 
a  vertical  cut.  Our  aligned  merge  uses  suboptimal  encodings  to  attain  an  optimal  sort!  We 
see  this  method  is  suitable  only  for  a  non-square  network.  Optimal  perimeter  sorters, 
however,  must  have  a  skewed  aspect  ratio,  so  there  is  no  inconsistency  in  our  construction. 
We  remark  that  a  minimum  area  dense  sorting  circuit  is  square,  and  thus  this  method  is  not 
by  itself  adequate  for  designing  optimal  dense  sorters. 

We  extend  the  construction  to  other  values  of  T,  logN^T^N^^.  The  construction  is 
essentially  the  same.  In  particular,  the  number  of  blocks  remains  at  log^,  but  the  running 
time  of  each  phase  is  increased  to  0(T).  The  packet  size  is  also  increased  to  0(T),  while  both 
the  number  of  channels  and  the  number  of  breakpoints  is  reduced  to  0(N/T).  The  area  used 
by  each  circuit  switching  sorter  is  reduced  to  0(N^/T^).  It  is  easy  to  see  that  our  construction 
requires  no  substantial  changes.    We  deduce 
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Lemma      1:       For     M  =  N,     and     logN  ^T^N^^,     there     are     perimeter     sorters     with 
AT^  =  e(NHogN). 

Extending  this  construction  to  larger  values  of  M ,  N^M^N^,  is  straightforward. 

Lemma  2:    For  N^M^N'^,  and  logN^T^^NloglM/Ny^,  there  arc  perimeter  sorters  with 
AT^  =  0(//2log//log2Af/Ar). 

Proof:   The  construction  is  much  as  above,  with  a  few  parameters  changed.    In  particular,  we 

use  — \2mL —  blocks.    The  only  part  that  needs  a  little  care  is  the  systolic  merging  (in  the 

log2M/N 
aligned  merge),   n 

We  complete  our  study  of  AT^  tradeoffs  for  perimeter  sorters  with  N^M^N^  in  section 
4. 

3.  Dense  Sorters 

Recall  that  [Thompson,  1979]  establishes  the  bound  AT^=  C1(N^)  for  dense  sorters  of  N 
numbers  in  the  range  [0,N].  We  now  show  that  this  bound  is  tight. 

Loosely  speaking,  the  dilemma  in  achieving  an  optimal  sorting  circuit  is  as  follows.  If  a 
circuit  is  to  sort  optimally  for  M=N,  then  the  total  flow  of  information  across  a  line 
partitioning  the  circuit  into  halves  must  be  0{N)  bits  [Thompson,  1979].  Moreover,  the  0{N) 
bits  must,  essentially,  describe  all  the  numbers  that  comprise  the  input  data.  This  difficulty, 
on  first  thought,  might  not  seem  too  serious,  since  N  numbers  in  the  range  [0,^]  have  an 
information  content  of  about  2^.  However,  before  N  numbers  can  be  compressed  into  such  a 
minimum  representation,  they  have  to  be  sorted  (or  the  equivalent).  So  there  is  a  paradoxical 
obstacle  that  must  be  overcome:  to  sort  the  numbers  they  must  be  compressed;  to  compress 
them  they  must  be  sorted.  Furthermore,  there  are  two  difficulties  intrinsic  to  our  divide  and 
conquer  approach.  First,  each  level  presumably  requires  log//  time.  Second,  the  sum  of  the 
information  content  of  the  parts  is  critically  larger  than  the  information  content  of  the  whole. 
As  we  shall  see,  this  suggests  a  0(log^)  time,  AT^  =  N^  sorter  is  unattainable. 

The  first  step  in  our  strategy  is  to  group  the  inputs  into  log^^  sets  of  N/log^N  inputs 
each,  called  portions.  Again,  each  portion  of  inputs  is  handled  by  a  block  of  the  circuit. 
These  blocks  Jire  laid  out  as  a  log//xlog^  array.  Each  block  includes  a  circuit  switching 
sorter  running  in  time  0(7).    The  outputs  of  these  sorters  are  then  merged,  as  described 

below.     Each  sorter  takes  area  0( ),   so  the   total   area  used  by  these  sorters   is 

r^log^^ 
0(N^/T^).  Thus,  if  we  could  merge  the  output  of  log^^  blocks  in  log^  time,  and  use 
N^Aog^N  area  for  the  merging  network,  we  would  have  AT^=  0(N^).  We  do  not,  however, 
know  how  to  accomplish  such  an  optimal  log^  time  merge;  in  fact,  we  conjecture  that  no 
such  circuit  exists.  Instead,  we  use  somewhat  smaller  £md  slower  blocks  to  reduce  the  total 
area  so  th^tAT^=0{N^)  for  T  =  \ogN\og'N. 

The  choice  of  0(log*^log^)  time  is  motivated  by  the  necessities  of  data  compression  at 
various  stages  of  the  sort.  To  be  specific,  let  us  imagine  we  are  aiming  for  an  O(log^) 
sorting  time.  In  logA^  time  we  can  transmit  up  to  N/\ogN  bits  across  the  boundary  of  a  block. 
In  this  time  we  wish  to  transmit  (in  encoded  form)  all  N/\og^N  numbers  input  into  this  block. 
A  little  thought  shows  that  any  encoding  of  these  NAog^N  numbers  needs  at  least  loglog^ 
bits  per  number,  on  the  average.  Suppose  we  merge  sets  of  outputs  from  the  blocks,  without 
encoding  them  any  more  densely.  Once  we  have  combined  the  outputs  of  (log///loglogA^)^  of 
these  blocks  (arranged  in  a  square  shape),  0{]ogN)  time  will  be  required  to  transmit  the 
merged  ensemble  of  bits  across  the  boundeiry  of  this  square.  If  we  try  to  do  any  further 
merging  without  recoding,  O(log^)  time  will  be  insufficient  for  transmitting  the  merged  set 
of  bits  across  the  boundary  of  the  square  merging  region.  (Notice  that  a  region  with  a 
skewed  aspect  ratio  (and  hence  larger  perimeter)  will  not  improve  the  data  throughput,  since 
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the  internal  merge  of  data  will  be  limited  by  the  shorter  dimension.)  Thus  it  seems  we  are 
obliged  to  recode  the  inputs  at  this  point.  The  merging  region  has  data  representing 
N/(\og\ogN)'^  inputs,  and  a  perimeter  of  length  ///log^loglogA/^.  By  iterating  the  argument, 
we  deduce  there  must  be  at  least  log*^  receding  stages  in  the  sorter.  These  receding  levels 
form  the  basis  of  our  construction.  This  contrasts  with  the  perimeter  sorter  where  we  are 
able  to  have  just  one  receding.  (The  above  discussion  is  intended  to  motivate  our 
constructions;  it  is  not  likely  to  provide  a  lower  bound  proof,  since  receding  can  be  done  on 
the  fly  in  constant  time.  Our  discussion  also  illustrates  some,  though  not  all  of  the  difficulties 
that  must  be  overcome  in  designing  an  optimal  sorter  for  this  number  range.  We  believe  that 
the  true  obstacle  to  an  optimal  log^-time  sorter  is  the  data  rearrangement  inherent  to 
merging.) 

There  a  few  more  properties  that  can  be  anticipated  about  our  next  construction.  If,  for 
M  =N,  there  is  an  AT^-optimal  sorter  based  on  merging,  then  the  final  merging  network  must 
run  in  time  @(T),  since  it  must  transmit  Q(N)  bits  across  a  cut  of  length  Q{N/T).  On  the 
other  hand,  if  we  have  log'N  levels  of  merging  networks,  and  if  they  cannot  be  pipelined, 
then  at  least  half  of  them  must  run  in  time  0(logN)  to  achieve  a  sort  in  r=  logiVlog*//.  Thus 
we  adopt  a  two-type  merging  structure;  the  first  type  will  comprise  log*^  levels,  each  running 
in  log^  time,  and  the  second  type  will  contain  a  hierarchy  with  running  times  interpolating 
between  log^  and  log^log*^.  Moreover,  the  final  merging  network  will  consume  area 
@{N^/T^);    but    each    level    of    the    type    1    merging    networks    ought    to    consume    area 

0(N^/T^og'N)    (or  better  still,   have  side  length  0( ))   to   remain  within  the  area 

Tlog'N 
bound.  Since  each  level  of  our  type  one  merging  network  is  log'W  faster  and  smaller  than 
our  complete  sorter,  the  parameters  stating  just  how  many  inputs  can  be  absorbed  per  level 
(i.e.  without  receding)  have  to  be  recomputed.  There  will  still  be  only  0(\og*N)  levels, 
however.  Achieving  all  these  features  simultaneously  is  the  major  challenge  of  the 
construction. 

Our  sorter  turns  out  to  use  log*^  levels  of  a  merging  network  that,  at  level  /,  merge  the 
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outputs    from    about 


Iog(/-i)jV 


level    /—I    networks.     Each    level    runs    in    time 
\og<-'^N(log'N)^ 

0(logN).    This  method  breaks  down  at  a  level  comprising  approximately  (log*//)*  networks. 

The  merging  is  completed  with,  effectively,  an  H-tree  of  binary  merging  devices,  whose 

loglog'A^  levels  have  running  times  that  form  a  geometric  series.    Careful  tuning  is  needed  to 

keep  the  area  within  bounds.    The  detJiils  can  be  found  in  appendix  B.    We  can  now  deduce 

Lemma  3:   For  M=N,  T=\ogN\og*N,  there  exist  VLSI  sorters  with  AT^  =  ©(Z/^). 

This  result  can  be  extended  to  larger  values  of  T  in  many  ways.  For,  say,  T^log^'*^, 
we  can  simply  proceed  as  before,  but  rescale  the  number  of  bits  per  wire  up  by  a 
TAogNlog'N      factor,      and      rescale      the      wire      count      down      correspondingly.       For 

\og^"^N^T-^{N/\ogNy'^,  it  is  simpler  to  use  an  H-tree  exclusively.    There  are  — loglog// 

4 
stages,  where  stage  k  has  2^T/\og^'*N  bits  per  wire.  A  stage  comprises  two  levels  of  a  binary 
tree,  so  that  the  merges  are  all  pairwise.  With  this  organization,  the  task  of  going  from  a 
nearsort  to  a  perfect  sort  is  quite  simple,  even  for  large  T .  We  choose  the  blocks  to  take 
NAog^N  inputs  (they  are  essentially  circuit  switching  sorters,  except  that  they  produce  their 
output  in  encoded  form).  It  is  clear  this  yields  AT^  =  Q(N^)  for  logN ^T ^  (NAogNY^ .  Since 
T=(NAogNy^  is  the  slowest  the  circuit  switching  sorters  can  run,  for  NAog^N  inputs,  a 
different  sorting  strategy  is  required  for  larger  T. 

Our  construction,  for  T  =  N^^,  uses  the  following  merging  network  (type  3),  indexed  by 
the  parameter  x.  The  network  is  a  box  configured  around  a  4x  4  array  of  units.  Each  unit 
produces,  as  output,  the  encoded  form  of  N/(64x^)  inputs.  The  merging  network  accumulates 
4  waves  of  outputs  from  these  units,  and  produces  these  N/x^  input  values  in  encoded  form  as 
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its  output.  The  outputs  from  the  merging  network  leave  on  N^^\ogx/x^^  wires.  The  increase 
in  side  length  due  to  the  merging  network  is  0(N^^logx/x'^^),  so  an  *xx  array  of  networks 
incurs  a  side  length  increase  of  0(^''^logx/x''^).  The  running  time  of  the  merging  network  is 
the  maximum  of  OiN^'^/x^^),  and  the  time  for  the  four  waves  of  inputs  to  be  output  by  the 
units.  (We  remark  that  the  latter  term  turns  out  to  be  the  larger.)  The  merging  network  is 
described  in  appendix  C. 

We  build  l/21oglogA^  levels  of  the  type  3  merging  networks.  The  units  at  the  bottom 
level  of  our  construction  are  essentially  circuit  switching  sorters  that  take  N/log^N  inputs 
(except  that  the  output  appears  in  encoded  form  on  //''^loglog/Z/log^'^iV  wires).  Thus,  there 
are  log^N  circuit  switching  sorters  at  the  bottom  level  that  process  logA^  waves  of  NAog^N 
inputs.  At  level  /+1,  the  units  are  the  boxes  at  level  i.  Our  construction  uses  area  0(N) 
(the  increases  in  side  length  due  to  the  levels  of  merging  networks  form  a  modified  geometric 
series). 

The  sorters  at  the  bottom  level  run  in  time  0{N^'^/logN)  time  per  wave.  A  merging 
network  k  levels  up  requires  4*  waves  of  inputs  to  the  sorters,  and  thus  runs  in  time 
0(4^N^^/logN).  We  observe  that  as  a  merging  network  is  outputing  one  wave  it  can  be 
processing  the  next  wave.  Thus,  the  running  time  of  the  whole  system  is  0(N^^).    We  deduce 

Lemma  4:   Foi  M=N,  and  T  =  N^^  there  exist  VLSI  sorters  with  AT^=  Q(N^) . 

It  is  fairly  straightforward  to  construct  circuits  for  the  remaining  values  of  T, 
{NAogNy^^T^N^^.  We  use  the  strategy  just  described,  with  two  changes.  The  log^N  units 
at  the  bottom  level  of  the  construction  are  circuit  switching  sorters  which  take  N^/(_T^og^N) 
inputs  per  wave,  and  run  in  time  @(N/(TlogN))  per  wave.    We  switch  to  an  H-tree  after 
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l/21ogl ° —     levels  of  the  type  3  merging  networks,  that  is,  when  all  of  the  inputs  have 

been  read  by  the  circuit.  (We  have  to  choose  type  3  mergers  of  an  appropriate  size  and 
speed.  Since  the  idea  is  exactly  the  same  as  in  the  special  case  presented  above,  the  details 
are  left  to  the  reader.)    We  obtain 

Lemma  5:    For  Af  =Ar,  and  logN\og'N:sTsN^'^,  there  exist  VLSI  sorters  with  A 7^  =  Q(N^). 

It  is  not  difficult  to  extend  these  results  to  number  ranges  where  N^M^N^.  For  the 
time  range  logNlog'N ^T^N^^,  we  could,  for  example,  simply  replace  each  wire  by  a  bus  of 
\og2M/N  wires.  For  very  slow  times  such  as  T=  (NloglM/Ny^,  it  suffices  to  increase  the 
wire  (and  packet)  count  for  each  stage  by  a  factor  of  log^^2M/N.  All  packet  lengths  (and 
merging  times)  are  also  increased  by  the  same  factor.  It  is  convenient  (though  not  necessary) 
to  increase  the  number  of  inputs  (per  wave)  to  each  circuit  switching  sorter  by  a  factor  of 
log^2M/N  and  to  decrease  the  number  of  sorters  by  log^2M/N.  The  numbers  of  input  waves 
and  merging  stages  are  adjusted  accordingly.  (We  remark  that  the  circuit  area  when 
T  =  {N\og2M /Ny^ ,  is  proportional  to  the  information  content  of  the  input,  and  no  further 
increase  in  time,  therefore,  is  useful  [Siegel  1985].)  For  the  very  fast  times  when 
logN^{N,2M/N)^T<logN\og'N,  the  construction  is  analogous  to  the  case  M  =  N  for  small  T: 
the  principal  modifications  are  that  the  number  of  circuit  switching  sorters  is  reduced  by  a 
factor  of  log^2M/N  (with  a  corresponding  increase  in  the  number  of  inputs),  that  log'iV  is 
replaced  by  ^(N,2M/N)  in  the  parameterized  constructions,  and  that  the  wire  counts  (and 
numbers  of  packets)  are  adjusted  to  accommodate  the  increased  information  flow.  Here 
3(j^>)')  —  ™iQ  i'Aog'-'^x^y.    As  a  consequence,  we  have. 

Lemma  6:  For  NrsM^N^,  and  logN^{N,2M/N)^T^(Nlog2M/Ny^,  there  exist  VLSI 
sorters  with  AJ^  =  Q(N^log^2M/N). 

4.   Perimeter  Sorters:  n  :s  m  :s  n^ 

We  complete  the  study  of  AT^  tradeoffs  for  M(.[N,N'^]  by  constructing  optimal 
perimeter  sorters  for  large  times,  up  to  T  =  (NlogNy^,  at  which  point  we  obtain  minimum 
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area  circuits.  We  start  with  the  case  M  =  N.  Since  constructions  for  T^N^^  are  given  in 
section  2,  it  suffices  to  suppose  T=kN^^,  where  Is  jks  (log//)''^.  The  circuit  uses  logA^ 
blocks,  configured  as  a  vertical  column.  Each  block  comprises,  essentially,  a  circuit  switching 
sorter  that  sorts  t^  waves  of  N/{k^logN)  inputs,  at  the  rate  of  one  wave  per  0(N'^^/k) 
timcsteps.  We  use  aligned  merging  networks  to  combine,  for  each  wave,  the  encoded  outputs 
from  logN/k^  consecutive  blocks.  Thus  there  are  k^  alignment  structures,  each  of  which  has 
N/k*  inputs  per  wave.  The  structures  run  in  time  0(N^^/k)  per  wave  of  inputs,  have  width 
0{N^^/k),  and  length  0{N^^\ogN/k^).  Altogether,  they  have  total  length  0(N^^logN/k)  and 
width  0{N^'^/k).  We  complete  the  sort  with  a  (21oglogJk)-level  tree  of  type  6  merging 
networks  (described  in  appendix  C). 

The  type  6  merging  networks  are  interconnected  in  a  sideways  binary  tree  organization 
that  has  one  network  unit  per  node.  Nodes  belonging  to  the  same  level  are  located  along 
vertical  columns.  A  merging  wave,  for  a  network  unit,  uses  the  outputs  from  two  waves  of 
its  two  children  at  the  next  lower  level.  Thus  the  packet  length  and  running  time  per  wave 
double  at  successive  levels  of  the  tree.  The  wire  count  is  adjusted  accordingly.  For  our 
current  application  (where  M>N)  a  /-th  level  merging  network  has  width  proportional  to  its 
(//i^2'(41ogJfc  -  2/)/Jfc^)-wire  bus,  and  has  length  0(N^^2'logN/k^).  Its  running  time  and 
packet  length  are  OiN^^2'/k). 

The  type  6  merging  networks  at  the  bottom  level  read  the  outputs  from  the  structures 
which  encode  N/k^  inputs  per  wave,  described  above.  At  this  bottom  level,  strings  of 
0{Jc^/\ogk)  consecutive  packets  from  each  structure  are  merged  (i.e.  concatenated)  together. 
No  sorting  is  needed,  since  the  aligned  merge  gives  a  perfect  sort  for  the  contents  of  each 
structure.  The  widths  of  the  merging  networks  form  a  modified  geometric  series,  the  last 
term  and  the  total  being  0(N^^/k).  The  running  times  of  the  merging  networks  can  be 
overlapped  (that  is,  while  the  mergers  at  one  stage  are  working  on  the  kth  wave  of  inputs, 
the  mergers  at  the  next  stage  down  are  working  on  the  k+lst  stage).  Thus  the  total  running 
time  for  the  networks  is  0{kN^^).   We  obtain 

Lemma  7:  For  M  =  N,  and  logN ^  T £  {N logNy^ ,  there  exist  VLSI  perimeter  sorters  with 
AT^  =  OiN^ogN). 

Proof:  The  structures  process  all  the  waves  of  input  in  time  0(kN^^)  =  0(7").  The 
structures  together  have  length  0((N^^logN)/k),  which  is  therefore  the  length  of  the  sorter. 
The  structures  have  width  0(N^^/k),  as  do  the  type  6  mergers.    Hence  AT^  =  Q(N'^logN).    n 

We  extend  the  result  to  N^M:SN^.  The  idea  is  to  replace  logiV  by  logN/log(2M/N). 
We  have  obtained  the  AT^  tradeoff  for  T^(N\og(2M/N)y^.  So  let  7"=  Jk(Ariog(2A//A^))^'^, 
l^k^(\ogNAog{2M/N)y^.  We  use  blocks  as  before:  each  one  takes  N/^kHogN/logilM/N)) 
inputs.  We  merge  the  inputs  to  log///()t2log(2A//A/^))  blocks,  using  an  aligned  merge.  This 
yields  structures  that  have  N/k"*  inputs.  We  use  k^  of  these  structures,  receiving  their  inputs 
in  P  waves.  As  above,  we  combine  the  waves  of  inputs  to  these  structures  using  the  type  6 
merging  networks.   We  obtain 

Lemma  8:  For  N:SM^N^,  and  logN^T^(N\ogNy^,  there  exist  VLSI  perimeter  sorters 
withAT^  =  0(iV2logAnog(2A//iV)). 

5.  Dense  Sorters:  isms:n 

We  start  by  obtaining  an  AT^  tradeoff.  We  will  use  type  4  merging  networks, 
parameterized  by  x.  This  network  is  a  box  configured  around  an  8x  8  array  of  units,  each  of 
which  uses  0{Mlog(2x/M))  bits  to  output  x^M  values  in  encoded  form.  The  merging 
network  produces  as  output  its  64x  inputs  in  encoded  form,  using  0(Anog(128x/A/))  bits.  Let 
p  =  l/6\og(2x/M).    A  level  p   merging  network  produces  its  output  on  — 4^*'  wires.    The 

increase  in  side  length  due  to  one  instance  of  a  p-th  level  merging  network  is  0(M ■4P'*'^/T). 
We  note  that  for  xs=64Af,  the  output  of  our  merging  network  has  (a  maximum)  information 
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content  bounded  by  twice  that  of  a  single  input  unit.  Evidently  the  wire  count  of  ap-th  level 
merging  network  is  4  times  that  of  a  (p  —  l)-th  stage,  so  as  long  as  a  packet  header  needs  only 
a  nominal  portion  of  the  packet  length  for  an  output  wire,  this  merging  network  has  the 
property  that  an  output  packet  has  a  bit  length  at  most  half  the  length  of  a  unit's  packet. 
Also,  the  wire  count  will  never  exceed  x.  Consequently,  our  network  can  merge  in  time 
0(max[r-(l/2y'^,logx]).  (As  a  matter  of  courtesy  to  the  reader,  we  include  both  terms 
when  computing  the  total  merging  time,  if  it  is  not  immediately  clear  which  term  is 
dominant.)  We  show  how  to  construct  these  merging  networks  (type  4)  in  appendix  C. 

Case  1:  log^M^TsAf '^. 

Case  la:  N^M^.  We  build  a  hierarchy  of  the  type  4  merging  networks,  starting  with  an 
(N/My^y.  {N/hty^  array  of  units,  each  of  which  is  a  dense  sorter  for  M  numbers  and  runs  in 
time  0(T).  The  area  used  by  these  units  is  0(N/M  -M^T^).  The  hierarchy  uses 
1/6  log  (2A^ /A/)  levels  of  type  4  merging  networks.  The  time  taken  to  sort  is 
0(T+T/2+  ...)  +  0{\ogmog{2N/M))  =  0(T).  The  additional  side  length  needed  for  the 
merging  networks  forms  a  geometric  series  of  the  form  0((A^Af/r^)''^+ l/2(//Af/r^)''^+  ...). 
So  the  total  area  used  is  0(NM/T^). 

Case  lb:  M'^^N  and  max[log^M,log//]srsA/'^.  We  start  with  an  N^^/M  x N^^/M  array  of 
units,  each  of  of  which  is  an  M^  input,  T-time  optimal  dense  sorter,  as  constructed  in  case  la. 
Each  such  unit  is  an  M'^^/T  x  M^'^/T  square  and  has  M'^'^/T^M  output  wires.  Our  encoding 
scheme,  for  the  output  of  each  unit  of  M^  input  numbers,  is  modified  to  produce  M  counts, 
each  of  21ogA/  bits:  the  count  on  wire  j  is  just  the  number  of  instances  of  the  number  j 
among  the  unit's  inputs.  The  units  are  interconnected  in  an  (A/-multiple)  H-tree  of  N/M^ 
unit-leaves.  Each  internal  tree  node  is  a  systolic  ripple  adder.  The  additional  area  used  by 
the  H-tree  is  0(N).  The  time  used  by  the  H-tree  is  O(log^),  giving  a  total  sorting  time  of 
O(log^-^r).    We  obtain 

Lemma  9:  For  max[\og^M ,\ogN]^T^M^'^,  and  M^N,  there  exist  VLSI  sorters  with 
AT^=eiMN). 

Case  2:  max[log/i/',logAf log'A/Jsr^log^Af.  We  have  to  proceed  more  delicately,  as 
was  the  case  for  M=N  and  r=  logiVlog'A'^.  Note  that  M  cannot  be  too  small:  M^2'-^°«'^'^. 
The  construction  begins  as  in  case  la,  with  N/M  units,  each  of  which  is  an  optimal  7-time 
dense  sorter  of  M  numbers.  We  then  use  m'm[log^N/M ,  loglog*Af]  levels  of  the  type  4 
merging  network.  The  sorting  will  be  finished  if  N/M  ^  (log' M)^.  Otherwise,  each  merging 
network,  at  the  top  level,  will  have  Af  (log'M)*  inputs,  and  output  on  A/(log*Af)2/7  wires, 
with  at  most  0(TAog'M)  bits  per  wire.  (For  7=  logAf  log'Af  this  is  O(logAf)  bits  per  wire). 
Then  another  network  hierarchy,  type  5,  is  used  for  further  merging,  and  finally,  jm  H-tree  is 
used,  if  necessary,  to  complete  the  sort  as  in  case  lb  above. 

The  type  5  merging  network  is  parameterized  by  x  where  x^(log*3f)^,  and 
2x/(iog-Ao':s7-log«jV/.  It  is  a  box  configured  around  an  array  of  (2^('°8''^Vx)  x  (2^('°8''^Vx) 
units  that  each  inputs  data  representing  Mx'  values,  and  that  outputs  these  numbers  in 
encoded  form  (using  0(A/logx)  bits).  The  outputs  from  each  row  of  units  in  the  array  is 
align  merged  (as  described  in  appendix  C)  and  then  the  row  outputs  are  merged  (by  packet 
sorting  rather  than  alignment).  The  resulting  merge  outputs  its  mV^'-^°^''^^  inputs  in  encoded 
form  using  0(Afx/(log*A/)2)  bits.  It  runs  in  time  max[r/log*A/,2^<'°8'A^Vx]  =  TAog'M,  and 
has  Mx/(T\og'M)  output  wires.    The  increase  in  side  length  due  to  the  row  merging  network 

Af  2^('°8*'*^^' 


is  0\ ; .    Since  there  are  0(///Af  4*'('°8"^^  merging  arrays  of  the  networks  at  this 

level,  the  total  side  length  increase  is  0(iV/3/4*^('°8''^V^    ^^"^""^ =  0(  (^^)'^)     The 

[     riog'Af     J  T\og'M 

merge  of  the  rows  contributes  the  same  increase  in  side  length.    We  show  how  to  construct 
this  merging  network  (type  5)  in  appendix  C. 

Page  16 


We  use  up  to  log'Af  -  log*log*A/  levels  of  the  type  5  merging  network  so  that  at  the  top 
level  the  number  of  inputs  to  a  merging  network  is  between  min[//,Af  (riog*A/)^]  and 
min[//,A/(log(riog*W)(log*Af)^)^].  These  networks  induce  a  total  side  length  increase  of 
0{{MN/T^y^),  and  the  total  running  time  for  the  hierarchy  is  0(T). 

It  is  possible,  however,  that  further  merging  is  still  required.  By  adding  one  extra  stage 
of  a  (possibly  subsized)  type  5  merging  network,  if  necessary,  we  may  suppose  that  at  the  top 
level  of  our  hierarchy,  2-^^'°«''^^=  Tlog'M.    In  this  case,  each  top  level  network  has  a  side 

w2x/(log'A0^ 

length   of   at   least    -^^^-= ^M.     Then   the   merge   is   completed    by   switching   to   the 

riog'Af 
encoding  described  in  case  lb,  and  by  using  an  Af -multiple  H-tree.    We  obtain 

Lemma    10:     For    max[\ogN,logM log' M]^T^M^^,    and    log^N^M^N,    there    exist   VLSI 

sorters  with  AT'^  =  Q{MN). 

Next,    we    obtain    an   AT    tradeoff   for   larger   values    of   T.     First,    we    describe    the 

construction  for  maix[\ogN ,M^^]^T^min[N/M^^,N/(M^'^log-=^)].    The  upper  bound  on  T 

has  a  natural  interpretation.  When  T  =  N/M^'^,  the  inputs  are  pipelined  into  a  single  AT^- 
optimal  sorter  and  the  output  is  processed  further.  For  faster  times,  we  use  several  such 

sorters  in  parallel.  The  N/(M^^log )  time  bound  occurs  for  the  smallest  circuit  possible. 

M 

Case    1:    max[logN,M^'^]^T^mm[N/M^^,N/(M^^\og-^=^)].     The    first    stage    of    our 

M 

constructions  uses  optimal  sorters  (described  in  Lemma  10)  to  sort  sets  of  M^  numbers  in 
time  0{M^^),  and  area  Af^.  We  use  N/{TM^'^)^l  of  these  sorters.  We  refer  to  the  inputs 
that  arrive  in  0(M^^)  time  as  a  wave  of  inputs.  There  are  T/M^'^  waves  of  inputs.  The 
sorters  are  connected  by  an  A/ -multiple  H-tree.  The  area  for  such  an  N/(TM^^)-\ezi  tree  is 
NM^^/T.  Each  H-tree  is  used  to  count  the  number  of  occurrences  of  one  input  value.  An 
internal  node  comprises  a  single  bit  full  adder  plus  a  few  bits  of  storage  and  control.  This 
construction  clearly  has  AT  =  0{NM^^).  Now,  we  need  T^logN  as  always,  and  T  a  M^^ 
which  is  proportional  to  the  sorting  time  of  our  basic  sorter.  In  addition,  we  need  Q(M\ogN) 
space  at  the  top  level  of  the  tree  for  the  counts.  Actually,  we  need  only  guarantee 
@(M\og2N/M)    space,    since    M<N^^.     Sufficient    area,    in    this    case,    is    implicit    in    the 

AT=eiNM^^)  tradeoff,  since  T:^N/(M^^\og—). 

M 

Case  2:  max[M^'^,N/M^'^]:sT:^N/{M^^\og(2N/M)).  Let  T  =  N/xM^^\og(2N/M),  and 
A  =  xM\og{2N/M).    At  the  lowest  level,  we  use  one  AT^-optimal  sorter  described  in  Lemma 

10.    It  has  area  A,  and  sorts  A  numbers.  It  runs  in  time  0(A/'^).  There  are  N/xMlog 

M 

waves  of  inputs.  The  sorter  outputs  its  data,  for  each  wave,  in  a  slightly  suboptimal  encoding. 
The  optimal  encoding  would  represent,  of  course,  A  numbers  in  the  range  [0,A/-1].  Instead, 

2N 

we    use    the    encoding    for   A  log numbers.    The    total    number    of    bits    used    is    still 

M 

'y  XT 

C>(A/log(xlog )).  The  output  is  sent  in  0(A/'^)  length  packets  on  A''^  wires  to  a  merging 

M 

2N 

sorter  that  stores  its  output  in  this  same  format  and  that  sequentially  merges  log waves 

M 

with  its  previous  (and  initially  empty)  output.  (Of  course,  if  there  were  fewer  waves  of 
inputs,  the  sort  would  be  completed  when  all  the  inputs  have  been  merged.)  Each  wave  is 
merged  with  the  previous  output  in  Q{M^^)  time.  This  merging  sorter  also  has  area  A.  It 
outputs  the  merge  of  \og2N/M  waves  to  a  unit  that  recodes  the  data  into  the  format  used  to 
represent  N  numbers  in  the  range  [0,A/— 1]  (the  recoding  method  is  discussed  in  appendix 
D).  Now,  the  packets  are  of  length  (M\og2N/My^.  The  M\og2N/M  bits  are  sent  to  a  similar 
(and  final)   merging  sorter  that  merges   these  group  waves   at  the  rate   of  one  merger  per 

A/''^log time.  This  sorter  has  area  A.    We  deduce 

A/ 
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Lemma    11:     For    I^M:SN.    and    max[logN,M^^]^T^N/(,M^^log—),    there    exist    VLSI 

M 

sorters  with  AT  =  Q(NM^^). 

6.   Perimeter  Sorters:  is M:sN 

We  start  with  the  case  N'^'^^M  <N ,  and  then  consider  XogN  -^M  <N^'^ ,  followed  by 
M<\ogN . 

Case  1:  N^^^M  <N .  There  are  two  subcases  to  consider:  T^  (A/logA/)"^,  and 
(A/logA/)i^sr:s(AflogA/)i^log2iV/M. 

Case  1.1:  Ts  (Af  logA/)''^.  We  use  N/M  blocks,  which  each  contain  an  optimal 
perimeter  sorter  of  M  inputs,  and  which  run  in  time  Q{T).  The  outputs  of  the  blocks  are 
combined  by  an  aligned  merge.  Our  merging  requires  the  output  from  each  perimeter  sorter 
to  be  coded  in  a  form  using  M\og2N/M  bits  (i.e.  an  encoding  that  is  optimal  for  representing 
N  numbers  in  [0,A/  — 1]).  Appendix  D  shows  how  to  construct  such  an  encoding  from  the 
output  of  each  block.  We  require  that  each  block  have  sufficient  size  to  store  the  encoded 
data.  Thus  the  area  used  by  a  block  is  0((A/'logAf )/r^  +  MloglN/M).  In  particular,  a  block 
has  length  {{M\ogM)/T),  which  is  required  to  input  the  M  numbers,  and  width 
(Af/r  +  (nog2A^/A/)/logA/).     The   length   of   all  N/M    blocks    (and   hence   of  the   circuit)   is 

therefore    0(,{N\ogM)/T).    Our    encoding    will    result   in    the    use    of    ^^°S^^/^     merging 

structures  (and  breakpoints),  which  will  thus  guarantee  that  the  block  width  is  at  least 
(riog2/^/A/)/logA/.  There  is  still  one  difficulty:  when  N/M>T,  it  will  be  impossible  to  use 
our  bus  merging  structure  to  perform  an  aligned  merge  of  N/M  blocks  in  time  0(r).  We 
introduce  terrace  trees  (defined  below),  to  replace  the  bus  in  our  aligned  merge.  A  terrace 
tree  has  the  following  properties: 

(i)  It  is  a  binary  tree. 

(ii)  The  depth  from  the  root  to  any  leaf  (block)  is  at  most  T  (edges). 

(iii)  Its  leaves  (blocks)  lie  on  a  straight  line. 

(iv)  The  physical  layout  height,  h,  of  the  tree  is  minimal. 

[Yao  1981]  observed,  essentially,  that  a  i)-leaf  terrace  tree  must  have  a  layout  height  of 

h  =  n(-; — -r-rr. — — -).  where  T'^logb,  and  that  this  bound  can  be  achieved.    A  construction  is 
log(2r/logZ7) 

given  in  Appendix  F.    The  tree  takes  area  0(lh),  where  /  is  the  length  of  the  line  on  which 

the  leaves  lie.    Merging  (and  broadcasting)  on  the  terrace  tree  proceeds  just  as  merging  (and 

broadcasting)  on  a  merging  bus,  for  locally  the  bus  and  the  terrace  tree  have  essentially  the 

same  structure.    We  deduce 

Lemma  12:  For  N^^^M :SN,  and  log^s  rs  (A/logAf)"^,  there  exist  VLSI  perimeter  sorters 
with 


AT^  =  0(A^AflogA/log2Af/A/ 


log2N/M 


+  1 


log 


2T 


loglN/M  ) 


)• 


Proof:    There  are  N/M  blocks,  each  taking  area  OiiM^ogMyT^  +  M\og2N/M).    We  use  an 
aligned      merge     with      ^ terrace     trees,      each     of     layout     height     (width) 


O 


\or2N/M 


logilT/loglN/M) 


+  1 


,  and  of  length  0((NlogM)/T).    a 


Case      1.2:      (A/logM )'^:S  Ts  (A/logA/)"2log2///A/.        Let   '  r  =  x(MlogA/)''^,      where 
lSx^\og2N/M .    We  use  N/xM  blocks,  each  of  which  takes  x  waves  of  M  inputs.    Each  block 
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uses  an  optimal  perimeter  sorter  running  in  time  0((A/logAf)''^)  to  sort  sets  of  M  inputs. 
Each  block  produces  x  waves  of  outputs.  Next,  we  create  larger  units,  called  merging 
structures.  Each  structure  will  ultimately  merge  x  waves  of  outputs  from  x  blocks,  and  the 
result  will  be  output  in  encoded  sorted  order.  Finally,  an  aligned  merge  is  used  to  combine 
the  outputs  of  the  structures.  As  before,  the  receding  for  the  aligned  merge  is  done  in 
advance  of  the  actual  merge,  and  each  structure  needs  0{Mlog2N/M)  area  for  this  step. 

It  remains  to  explain  how  to  build  the  structures.  They  are  obtained  from  logx  levels  of 
type  6  merging  networks.  Recall  that  the  merging  networks  are  configured  as  a  binary  tree. 
A  node  merges  two  waves  of  outputs  from  its  two  children.  The  layout  length  of  a  node,  the 
packet  length,  and  the  running  time  all  double  at  successive  levels  of  the  tree.  The  area  of  a 
node  is  proportional  to  the  information  content  of  its  one  wave  of  outputs.  At  the  bottom 
level  we  choose  the  type  6  merging  networks  to  have  (Af/logA/)''^  input  wires  (and  hence 
width  0((A//logM)''^))  and  run  in  time  (^((A/logAf  )"^).  Because  M<N  in  this  instance,  the 
widths  (and  wire  counts)  form  a  decreasing  (modified)  geometric  series,  while  the  running 
times  double  at  each  level  of  the  network.  We  overlap  the  running  of  the  various  levels  of 
the  type  6  merging  networks,  so  that  the  overall  running  time  is  O (x (A/ log Af  )''^)  =  0(r). 

The    total    length    of    the    -^^^    structures    is    ^^'^^^ — —  =  (N\ogM)/T.     Their    width, 

Mx^  Afi^x 

exclusive  of  the  merging  network  is  0((Af  logA/)"^).  After  accounting  for  the  aligned 
merging,  we  deduce 

Lemma  13:  For  N^^:SM^N,  and  logN^T:S[og{2N/M){MlogMy^,  there  exist  VLSI 
perimeter  sorters 


with  A r^  =  0(Af/^logMlog(2;v/A/) 


\or(2N/M) 


log 


27         1 

\og{2N/M)  j 


+  1 


Proof:    The  aligned  merge  will  use  0((M\og2N/M)/T)  terrace  trees,  each  of  layout  height 

0( "^oglN/M ^+1).    xhe  length  of  the  network  is  oaN\ogM)/T),  and  the  width  of 

log  (2r/log  2A//A/ ) 
each  structure  is  0{{M\og2N/M)/T).    The  area  of  each  structure  is  CL{M\og2N/M),  which 
ensures  that  the  recoders  do  not  consume  excessive  area.  The  result  follows,    n 

An  AT  tradeoff  requires  the  same  construction  with  only  ° structures. 

TM^'^loglN/M 

Each   structure   is   pipelined   further  to   process    waves  of  Mlog^N/M 

(M\ogMy^\og2N/M 
inputs,  in  time  0((A/logAf)"^log2///Af )  per  wave.  (Each  wave,  in  turn,  is  pipelined  in 
\og2N/M  subwaves,  as  described  above.)  Each  structure  has  width  0((A//logA/)''^)  emd  length 
0((A/logAf  )''^log2Af/A/).  For  each  structure,  after  each  wave,  we  merge  the  outputs  of  the 
current  wave  with  the  outputs  from  the  previous  waves,  in  time  0((A/logA/)"^log2A^/A/), 
using  area  0{M\og2N/M).    Following  all  the  waves  of  inputs,  we  combine  the  outputs  of  the 

M 
structures  using  an  jdigned  merge  in  time  0(T).    The  — log2M/N+  1  terrace  trees  used  in  the 

aligned  merge  each  have  0(1)  height.  This  yields 

Lemma   14:    For  N'-^sA/siV,  and  \og(2N/M)(M\ogMy^^T:S ^^°^   '^ ,  there  exist 

A/i^log(2///A/) 
VLSI  perimeter  sorters  with  AT  =  0(7/ (A/ log  A/ )i^). 

Case  2:  \ogN ^ M -^ N^'^ .  We  use  min[///(A/IogA/loglogA/),Af/(r(A/logA/)'^)]  blocks, 
which  each  take  max[l,r/(A/logA/)''^loglogAf  ]  waves  of  A/ log  A/ log  log  A/  inputs.  Each  block 
sorts  a  wave  of  inputs  using  a  perimeter  sorter  that  is  described  by  Lemma  13,  and  that  runs 
in  time  proportional  to  min[r,(A/logA/)"^loglogA/].    Next,  the  inputs  are  encoded  in  a  count 
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representation:  for  each  value  we  take  0(logAf )  bits  to  record  the  number  of  instances  of  the 
value  in  the  inputs.  In  time  0((MlogM)"^logAf),  each  block  receives  at  most 
M  logAf  logAf  ^  Af ^logAf  inputs.  In  each  block,  we  combine  the  counts  from  all  the  waves  of 
inputs;  this  combining  is  carried  out  following  each  wave,  and  takes  time 
0(min[r,(A/logM)^'^logIogA/]),  and  thus  time  0(7")  over  all  the  waves  of  input.  Blocks  have 
length  /  =  0(max[(A/log2AfloglogA/)/r,logM  (Af  logA/)''^]),  and  width 

0(max[(MloglogA/)/r,(Af/logA/)'^]).  Thus  the  area  used  by  a  block  is  n(A/logA/),  which  is 
sufficient  for  receding  the  inputs  in  a  count  representation.  We  then  combine  the  outputs  of 
logW/logA/  of  these  blocks.  We  call  the  combined  \ogN/\ogM  blocks  a  structure.  Since  each 
of  the  blocks  uses  area  CL{M\ogM),  \ogN/\ogM  blocks  use  area  £l{M\ogN),  and  have  length 
Q{{l\ogN)J\ogM) .  Thus  there  is  sufficient  area  to  use  logA^  bits  per  count  in  each  structure. 
The  combining  of  blocks  into  structures  is  detailed  in  the  next  paragraph.  The  logA^-bit 
counts  output  by  the  structures  are  summed  by  c  =  max[l,(A/log//)/r]  terrace  trees.  We 
note  that  (/logiV)/logA/^c,  so  each  structure  is  long  enough  to  allow  all  c  terrace  trees  to  be 
attached  to  it.  Each  terrace  tree  carries  min[Af  ,r/log^]  counts,  each  count  using  \ogN  bits. 
It  is  easy  to  sum  these  counts  systolically  in  0{T)  time. 

The  purpose  of  a  structure  is  to  combine  the  counts  for  logA^/logA/ :s  A//logA/  blocks. 
The  combining  is  done  using  max[(A/logA/)/r,(Af/logA/)''^]  channels  of  width  0(1),  each  of 
which  carries  min[r/logAf  ,(A/logAf)'^]  counts.  These  counts  are  summed  systolically.  Since 
each  structure  receives  at  most  M^  inputs,  each  count  output  by  the  structure  will  require  no 
more  than  31ogA/  bits.  The  channels  run  in  time  0{T).  we  note  that 
/smax[(MlogA/)/T,(Af/logA/)"^],  so  there  is  room  to  attach  all  the  channels  to  each  block. 
We  obtain 

Lemma  15:    For  \ogN  ■s.M-^N^'^,  and  log/^s  rs:  (MlogA/)''^logA'^,  there  exist  VLSI  perimeter 

sorters  with  AT^  =  0( ^ ° — ). 

\og{T/\ogN) 

Proof:  We  use  min[Af/(A/logAf  loglogA/),/^/(7'(Af  logAf)''^)]  blocks;  thus  the  sorter  has  length 
0{{N\ogM)/T).  Each  block  has  width  0(max[(Af loglogA/)/r,(A//logAf )"2]).  There  are 
{MlogN)/T  terrace  trees,  each  having  layout  height 

0( ^oPi^N/M .  ^  ^, log^^^ . 

Mog(2r/log2Af/A/) '^  ^\og{2T/\ogN)'' 

Each  structure  has  max[(A/logM)/r,(A//logA/)''^]  ^  {M\QgN)/T  channels  per  structure;  thus 
the  channels  use  width  0{{M\ogN)/T)  altogether.  So  the  area  used  is 
0(ArAflogAflog2///(r2log(2r/log/^))).    D 

The  same  construction  gives  circuits  for  larger  values  of  T.  Let  r  =  x(A/logA/)''^log//, 
where  Is  j:^  logA^/logA/.  The  input  is  read  in  x  waves,  each  wave  taking 
0((A/logA/)"^logA^)  time.  Our  circuit  uses  the  same  structures  as  in  Lemma  15  (for 
time=  0((Af logA/)'^log//))  to  sort  the  N/x  inputs  per  wave,  but  the  number  of  structures 
(i.e.  circuit  length)  is  divided  by  x.  There  are  max[l,A/log///r]  terrace  trees,  so  the  area  is 
scaled  down  by  a  factor  of  x'.  However,  eventually  the  sorter  width  fails  to  decrease  as  VT. 
This  occurs  when  the  layout  width  of  the  terrace  trees  matches  the  width  of  the  structures, 
i.e.  when  (Af/logM)'^=  0(logAf/log(2r/logAf))-((Mlog//)/r),  or  when  there  is  only  one 
terrace  tree,  i.e.  when  M\ogN/T=\.  Solving  for  T  gives  the  requirement 
7"smin[A/logA^,(A//logA/)''^log2Af].    We  conclude 

Lemma  16:    For  log^s A/ sA^'^,  and  (A/logA/)'^log//<rs  min[A/logAr,(A//logAf)"^log-//], 

there  exist  VLSI  perimeter  sorters  with  AT'^  =  QrMM°lM}°^lL)  _ 

For  yet  larger  times,  we  can,  by  pipelining  further,  reduce  the  number  of  structures 
until  just  one  remains,  at  which  point  the  area  is  as  small  as  possible.    We  obtain 
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Lemma    17:     For   log//:sM ^/Z'^,    and   miD[MlogA^,(A//logA/)'^log^^]^r:s  ^C^joR^) — ^ 

M  iOgN 

there  exist  VLSI  perimeter  sorters  with  AT  =  0(//(A/logM)"2  +  JD2EE}£SM^), 

log^zr/iogiVj 

Case  3,  M <\ogN .  Our  constructions  for  very  small  M  are  similar  but  more  complex. 
We  start  with   ^^^F.^i^^^K'^) —  blocks  that  take  T/  \ogMiM\ogMy^  waves  of  M^  inputs. 

The  blocks  are  perimeter  sorters,  as  described  in  lemma  13,  that  sort  each  wave  of  M^  inputs 
in  time  proportional  to  min[r,logA/(A/logA/)"^]  =  logA/(MlogAf  )''^  and  provide  the  output 
in  count  representation.  They  have  length  0(Af  (M/logM)"^),  and  width  C»((A//logA/)'^). 
We  create  structures,  each  of  which  combines  all  the  waves  of  output  produced  by 
(logAf  log^)/Af  blocks.  We  describe  how  to  build  the  structures  in  the  next  paragraph.  We 
complete  the  sort  by  combining  the  outputs  of  the  structures  using  terrace  trees,  as  above. 

A  structure  is  built  as  follows.  The  outputs  of  the  blocks  are  combined  using 
(Af/logM)'^  column  counters,  described  in  appendix  E.  A  column  counter  has  width  0(1). 
In  each  structure,  the  ith  counter  is  pipelined  to  accumulate,  over  all  waves,  the  number  of 
instances  of  the  values  j(AflogAf)"^,/(A/logA/)"^+ !,...,(/+ l)(A/logAf)"^- 1  that  occur  among 
the  inputs  to  the  structure's  blocks,  0£z<  (A/ZlogAf)''^.  A  counter  has  one  input  port  per 
block,  and  inputs  (MlogM)'^  logAf-bit  counts  per  wave  for  each  block,  one  count  per  value 
being  accumulated.  Column  counters  can  input,  at  one  port,  one  logA/-bit  number  in  every 
logAf  time  steps.  The  blocks,  during  each  wave,  therefore  have  sufficient  time  to  transmit 
their  outputs  to  their  connecting  counters.  The  accumulation  of  data  is  distributed,  so  that 
the  total  time  needed  to  compute  the  global  counts  of  each  counter's  input  data  equals  the 
time  to  input  all  the  data  plus  a  delay  proportional  to  the  memory  capacity  of 
©(logA/(A/logA/)'^-(logA/logiV)/A/)  bits.  Upon  accounting  for  the  requisite  stage  of  terrace 
trees,  we  obtain 

Lemma   18:    For  IsA/slog//,  and  logN^T^MlogN,  there  exist  VLSI  perimeter  sorters 

\og(2T/\ogN)   ' 

Proof:    We  use  ^^og^(^loS^) —  blocks;  so  the  length  of  the  sorter  is  OffA^logA/Vr).    The 

width  of  the  blocks  is  OCCA/ZlogA/)'^),  and  there  are  (A//logA/)'^  column  counters  each  of 
width     0(1).     We     use     max[(A/logA'')/r,  1]     terrace     trees,     each     having     layout     height 

^^og(Sgyv)^-  ^^"'  **'''  ^'^^^  °^  "^^  '°'^" '" 

0((A//logA/)>^  +  ^;°S'^     ^    +  ""^SK ).□ 

^^         ^     '  TlogilT/logN)         log{2TAogN)^ 

The  same  proof  shows 

Lemma   19:    For  IsA/slog//,  and  MlogN^T^  N{MlogM) — ^  ^^^^  ^^.^^  ^j^^j  perimeter 

MlogN 
sorters  with  Ar  =  0(/^(A/logM)i^  +     Nlo^Nlo^M  . 

\og{2T/\ogN) 

7.   Dense  Sorters:  N^:sM 

We  will  be  considering  sorters  that,  for  large  M ,  run  in  time  much  smaller  than  logAf , 
so  we  need  to  specify  how  the  numbers  are  input:  they  are  input  in  sets  of  2\ogN  contiguous 
bits.  The  overall  scheme  is  to  sort  corresponding  pieces  of  each  of  the  N  numbers  locally, 
and  to  communicate  globally  as  little  information  as  possible,  namely  the  ranks  of  the  inputs. 
The  idea  for  this  parsimonious  organization  originates  in  [Leighton  1984].  By  introducing 
some  new  merging  organizations  when  necessary,  and  by  careful  tuning  in  general,  we  show 
how  to  design  such  sorting  circuits  optimally,  thereby  improving  results  given  in  [Leighton 
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1984],  and  extending  the  ranges  of  area  time  tradeoffs  for  M  and  T.  Interestingly,  while  we 
obtain  an  AT^  tradeoff  for  sorting,  for  large  enough  values  of  M  we  do  not  obtain  solely  an 
AT  tradeoff;  in  addition  there  is  an  AT/\ogA  tradeoff  (this  is  due  to  [Bilardi  and  Preparata, 
1985b]).  However,  the  AT  and  AT^  tradeoffs  do  include  all  possible  times  for  large  enough 
M  for  two  related  problems:  ranking,  and  sorting  in  the  case  that  the  inputs  are  read  twice. 

We  will  use  the  following  sorting  circuit  as  a  basic  subunit.  It  inputs  the  names  of  N 
numbers  (each  expressed  in  log^  bits),  and  also  inputs  the  same  21og//  consecutive  bit 
variables  for  each  of  the  numbers,  and  outputs  the  name  of  each  number,  followed  by  its 
local  rank.  (The  rank  of  a  number  is  the  count  of  numbers  it  exceeds,  and  a  local  rank  is  the 
rank  of  an  input  based  solely  on  the  2\ogS  input  bits  for  each  of  the  N  numbers  read  into  the 
sorter.  Equal  numbers,  quite  clearly,  get  equal  ranks.)  A  trivial  modification  of  a  circuit 
switching  sorter  gives  such  a  device.  It  sorts  the  A^  numbers  based  on  their  2\ogN  bits.  Next, 
a  binary  tree  located  across  the  end  of  the  sorter  computes,  for  each  number,  its  rank,  which 
is  just  the  position  of  the  first  number  having  the  same  value.  The  21ogA^  data  bits  are  then 
replaced  by  the  logN-bit  ranks.  It  is  convenient  to  sort  the  numbers  by  name,  so  a  second 
sorting  is  performed,  using  the  name  as  the  key.  The  sorter  performs  an  order  preserving 
compression,  taking  21og^  bits  of  the  input  down  to  \ogS  bits.  (It  also  leaves  the  numbers 
ordered  according  to  their  place  of  origin.)  We  call  the  work  done  by  this  device  a 
compression  sort. 

Let  g  =  log   i££^  I      To  sort,   when  g\ogN-^T-&{N\ogNy^,   we  will  use   a  merging 
(  \ogN  J 

circuit    organized    as    a    binary    tree    of    \ogM/\ogN    leaves,    where    each    node    contains    a 

compression  sorter.    Let  the  leaves  be  numbered  consecutively  according  to  a  depth  first 

traversal  of  the  tree.    The  ith  leaf  receives  as  input  the  ith  set  of  21ogA^  consecutive  bits 

belonging  to  each  of  the  input  variables.    Each  leaf  stores  its  original  (unsorted)  input  data, 

generates  variable  names  for  the  data,  and  applies  a  compression  sort  to  the  input  records. 

The  input  to  each  internal  node  consists  of  the  output  from  its  two  children,  where  for  each 

number  (by  name)  we  concatenate  the  ranks  awarded  by  the  children;  this  concatenation  is 

simple,  since  the  numbers  arc  always  output  ordered  by  name. 

The  true  rank  of  each  of  the  N  numbers  will  be  computed  at  the  root  of  the  tree.  Once 
these  rankings  are  computed,  it  is  a  simple  matter  to  broadcast  them  to  each  of  the  leaves, 
whence  a  final  sorting  of  the  original  data  based  on  the  ranks  produces  the  desired  sort. 

The  tree  is  laid  out  as  an  H-tree,  but  blocks  at  different  levels  are  tuned  so  that  they 
may  rim  at  different  speeds.  The  leaves  use  compression  sorters  running  in  time  0(r).  Each 
time  we  go  up  four  levels  in  the  (binary)  tree,  the  running  time  of  the  corresponding  blocks 
is  halved,  until  A\ogg  levels  have  been  traversed.  This  results  in  doubling  the  side  length  of 
the  blocks  as  we  go  up  every  four  levels  in  the  tree.    It  is  now  easy  to  show: 


logAf 
logN 


\T^{N\ogNy^,    there    exist    dense    VLSI 


Lemma    20:     For   M^N'^,    for    log//log 

sorters  with  AT^  =  Q(N^ogN\ogM). 

Proof:     Each    block    that    is    a   leaf   takes    area    0{^ ° — '—),    and    thus    has    side    length 

0( — -f — )  (and  runs  in  0(7")  time).    Since  the  side  length  of  the  blocks  only  doubles  every 

fourth  level  in  the  tree  we  deduce  that  the  side  length  of  the  sorter  is  0(/'^ — ^ — ),  where  / 

is  the  number  of  leaves  in  the  tree.    There  are  — ° —  leaves  in  the  tree,  so  the  area  used  by 

logN  ' 

the  H-tree  is  0{N^\ogN\ogM/T^).  Since  the  time  to  perform  the  action  of  a  block  (essentially 
sorting)  halves  every  fourth  level  in  the  tree,  for  the  first  41og^  levels,  the  time  used  by  these 
levels  is  0(T).    The  remaining  levels  each  run  in  time  0{T/g),  and  there  are  0(g)  of  them; 
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thus  they  use  a  further  0(7")  time.    The  result  follows,    n 

We  remark  that  we  might  expect  to  build  A7"^-optimal  sorting  circuits  using  as  little  as 
0(loglogA/)  time.  And  in  fact,  we  can  obtain  optimal  results  for  smaller  values  of  T.  We 
give  optimal  results  for  log;V-min[loglogAr,log21og^^^]  +  loglogAf  sTs  (//logA^)^^.    By 

using  a  slightly  different  input  schedule,  one  can  also  obtain  results  for  the  remaining  smaller 
values  of  T  (details  are  left  to  the  interested  reader);  these  results  are  not  known  to  be 
optimal. 

The  key  to  our  constructions  is  another  sorting  network,  which  sorts  N  rlogiV-bit 
numbers,  2^rsN,  in  time  0(\ogN).  It  is  a  simple  task  to  modify  the  mesh  of  trees  (see 
[Leighton  1984])  to  have  paths  of  width  r,  and  to  replace  the  comparators  at  each  grid  point 
by  trees  of  height  logr.  This  will  give  a  circuit  taking  area  0(N^r^log^N),  that  sorts  in  time 
O(log^).  We  then  plug  this  sorter  into  the  Columnsorter  construction  of  [Leighton  1984]; 
the  only  further  change  is  in  the  circuitry  for  performing  a  transpose  —  each  wire  has  to  be 
replaced  by  a  bus  of  width  r.  This  leads  to  a  sorter  using  area  0(N^r^)  and  running  in  time 
0(,\ogN). 

Our  constructions  are  similar  to  the  ones  above.  Again,  our  sorter  can  be  thought  of  as 
a  tree  of  blocks,  laid  out  (essentially)  as  an  H-tree.  Let  T,,  be  the  running  time  of  the  blocks 
comprising  the  leaves.  Each  time  we  go  up  four  levels  in  the  tree  we  halve  the  running  time 
of  the  corresponding  blocks,  until  the  blocks  have  running  time  0(\ogN).  We  now  rapidly 
increase  the  degree  of  higher  level  nodes  in  the  tree.  Specifically,  let  the  compression  sorters 
at  four  consecutive  levels  of  the  tree  input  rlog^-bit  numbers,  where  r  <  N.  Then  the  next 
four  higher  level  will  use  sorters  inputing  r^logiV-bit  numbers.  This  degree  increase  stablizes 
when  r  reaches  N.  It  is  not  difficult  to  see  that  that  the  basic  H-tree  layout  organization 
suffices,  since  each  block  has  only  NlogN  bits  of  output;  there  is  room  to  provide  inputs  of 
r^logN  bits  per  number  to  the  blocks  at  the  next  level.  The  change  in  sorter  size  at  every 
fourth  level  ensures  that  the  area  used  by  the  higher  levels  of  the  tree  is  negligible. 

Lemma   21:     For  M^N^,   and   log//log21og^2£M.  +  loglogA/srs  (A^logiV)'^,    there   exist 

logN 
dense  VLSI  sorters  with  AT^  =  e(NHogN\ogM). 

Proof:  Let  the  leaf  blocks  run  in  time  7^,.  As  in  the  proof  of  lemma  20,  the  area  used  is 
0(NHogN\ogM/Tl).  The  time  used  by  the  sorters,  up  to  the  level  of  the  first  sorter  running 
in  time  0{logN),  is  0{Ti,).    The  time  used  by  each  of  the  remaining  levels  of  the  tree  is 

0(\ogN).     There    are    0(min[loglogiV,loglog— ° — ])    levels    to    increase    r    from    r  =  2    to 

logA^ 

r=min[N, — ^ — ].    There  are  another  O(logx, — ° — )  =  0( — ^ — ° — )  levels  to  reach  the  root 
^       logN  ^  ^    ^^  log//  ^  ^     logN     ' 

of  the  tree.    Thus  the  total  running  time  is 

0{J.  +  log//-min[loglog//,loglog  '"^"^  ]  +  loglogM).n 

logA^ 

For  Js  (NlogNy^,  we  can  obtain  AT  tradeoffs  as  usual  by  pipelining  the  data  through  a 
smaller  AT^  optimal  sorter.  In  Lemma  21,  we  have  constructed  trees  that  run  in 
Q({N\ogNy^)  time,  have  as  many  as  2^'-^^°^'^'  leaves,  and  that  have  an  area  proportional 
to  the  number  of  input  ports.  Thus  in  one  (NlogNy^  timestep  we  can  sort  as  many  as 
2&(.(.Niogi>r)^  contiguous  bits  of  each  number.    It  is  easy  to  see  that 

Lemma  22:    For  maxf     ^^°?<^      ,(N\ogNy^]:ST:s  (NlogNV^^^-^^,  there  exist  dense  VLSI 

sorters  with  AT  =  e(,N\ogM  (NlogNY^). 

Bilardi  and  Preparata  have  shown  that  there  is  another  area-time  tradeoff,  and  that  the 
tradeoff  is  tight.  For  completeness,  we  state  the  results  and  observe  that  our  constructions  are 
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also  adequate  to  establish  the  existence  of  such  sorting  circuits. 

Lemma     23:      For     2^'-^^°''^^  <  A  <  NlogM ,     there     exist     dense     VLSI     sorters     with 
^^     =  e(N\ogM). 


logA 

Proof:  Use  a  tree  of  A  inputs,  as  constructed  for  Lemma  21,  where  the  leaves  are  sorters 
running  in  ©((A^log//)'^)  time.  Then  the  ranks  are  computed  in  logA  time.  The  tree  can  be 
pipelined  at  that  rate  for  a  total  time  proportional  to  7=  (logA)(A^logA/)M.    n 

Combining  lemmas  22  and  23  gives: 

For  max[IoglogAf  ,(iVlog//)''^]^r^  (NlogNY^  °^^  ,  there  exist  dense  VLSI  sorting  circuits 

logN 

withAJ  =  0(AflogM((A'logAf)'^+logA)). 

Interestingly,  it  is  possible  to  extend  the  Bilardi-Preparata  AT  lower  bound  to  the 
problem  of  computing  ranks.  In  this  case,  it  still  turns  out  that  for  A<N\ogM ,  and  M>N^, 
AT-n.(NlogM(N\ogNy''^).  By  pipelining  our  tree  constructions  solely  at  the  leaves,  we 
deduce 

Lemma   24:    For  max[log\ogM ,(N\ogNy^]^T£  {NlogNY^^^^^,   there  exist  dense  VLSI 

logA^ 

ranking  circuits  with  AT  =  0(^logA/(AflogiV)'^). 

Thus  for  rslogA^min[loglogA^,loglog— ° — ]+loglogA/   these  results  provide  optimal 

logA^ 

dense  VLSI  sorting  and  ranking  circuits.  Furthermore,  if  the  input  data  can  be  read  twice  (or 
any  finite  number  of  times  exceeding  one),  then  the  AT  tradeoff  for  ranking  circuits  becomes 
a  tight  tradeoff  for  sorting  (we  read  the  inputs  once,  compute  ranks,  and  broadcast  the  ranks 
to  the  leaves;  then  we  read  the  inputs  a  second  time  and  sort  them  locally  based  on  their 
ranks,  which  gives  the  desired  sort). 

8.   Perimeter  Sorters:  n^:sm 

As  for  the  dense  sorters  with  M^N^,  input  bits  for  each  of  the  N  numbers  are  read  in 
sets  of  21ogiV  contiguous  bits  per  block.  Again,  we  obtain  an  AT^  tradeoff  for  sorting. 
However,  there  is  no  AT  tradeoff  for  sorting;  the  AT  tradeoff  applies  to  ranking  (and  hence 
sorting,  when  the  inputs  can  be  read  twice).  Instead,  we  obtain  two  other  tradeoffs  for 
sorting;  they  are  discussed  below.  Unfortunately,  we  are  not  able  to  show  that  most  of  these 
bounds  are  fully  optimal  (they  diverge  from  the  lower  bound  when  M  is  superpolynomial  in 
N).  The  fuUy  optimal  bounds  are  the  AT  tradeoff  for  ranking,  and  an  AT^  tradeoff  for  a 
sorting  verifier.  (A  sorting  verifier  inputs  N  numbers  and  N  ranks  in  a  when-  and  where- 
determinate  way  and  indicates  whether  the  rankings  are  correct,  i.e.,  whether  variable  X,  has 
rankr,,  i=  1,2,...,N). 

Let  g  =  log    ,  '"  ^^    .    To  sort,  when  g\ogN^T:Sg(N\ogNy^,  we  use  a  circuit  consisting 
(^  logN  ) 

of  a  binary  tree  with  logM/logN  leaves.  A  compression  sorter  runing  in  time  0{T/g)  is 
located  in  each  node  of  the  tree,  and  the  ranking  is  computed  as  in  section  7.  Thus  the  rank 
of  each  of  the  N  numbers  is  computed  at  the  root,  and  broadcast  to  each  of  the  leaves, 
whence  a  final  sorting  of  the  data  (if  available)  based  on  the  ranks  produces  the  desired  sort. 

In  fact,  the  size  of  the  tree  can  be  compressed  a  little.    We  use  only  —^ leaves. 

glogN 
Each  leaf  has  g  phases;  in  each  phase  a  further  logiV  bits  for  each  number  are  read.    The 
current  logjV  bits  (those  just  read  in)  are  appended  to  the  ranks  already  computed  and  new 
ranks  are  obtained  using  a  compression  sort.    The  binary  tree  is  laid  out  as  if  it  were  a 
grounded  terrace  tree  of  degree  T/g  (see  appendix  F). 
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A  grounded  terrace  tree  has  essentially  the  same  structure  as  a  terrace  tree  except  that 
the  processing  elements  that  would  belong  to  each  internal  node  are  located  in  nodes  laid  out 
along  the  same  line  as  the  leaves.  The  internal  nodes  are  just  switches  that  can  route  data  to 
the  processor  and  back,  or  can  bypass  a  processor  by  sending  data  to  the  next  internal  node 
in  the  tree.  The  processors  are,  essentially,  attached  to  their  corresponding  switch  nodes  by 
edges.  Appendix  F  gives  more  detail  about  these  structures.  A  degree  d  terrace  tree  is,  with 
one  modification,  essentially  a  complete  d-ary  tree  in  layout,  with  each  node  of  a  given  level 
at  the  same  height  in  the  layout.  The  modification  replaces  each  ^-ary  node  by  a  chain  of 
d—l  binary  nodes  at  the  same  height  above  the  leaves. 

The  terrace  tree  has  a  layout  height f — -  +  1.    The  value  d=  T/g  is  chosen  because 

log(r/^) 

it  is  (proportional  to)  the  time  in  which  neighboring  P  blocks  must  be  able  to  communicate 
(see  appendix  F).  Complete  binary  subtrees  of  d  processors  are  simulated  by  a  chain  of  d 
switches  at  a  given  layout  height,  plus  processors,  so  that  the  communication  time  between  a 
processor  and  its  parent  (with  respect  to  the  subtree)  is  0(d)  in  the  grounded  terrace  tree. 
This  time  penalty  for  communication  is  rendered  harmless  because  the  processing  time  is 
&(d),  so  that  the  overall  runing  time  is  at  most  doubled.  The  grounding  of  processors  allows 
the  bus  structure  to  have  an  AT^  tradeoff  (in  Lemma  27)  even  in  instances  when  the  number 
of  terrace  trees  is  is  less  than  (NiogNy^.  This  is  because  the  height  of  the  sorting  structure 
is  the  height  of  the  terrace  trees  multiplied  by  the  number  of  such  trees  plus  the  height  of  one 
processor.    We  deduce 


Lemma  25:    For  N'^^M,  A^N\ogM,  and  g\ogN:sT^g{NlogNy^,  where  ^  =  log 


there  exist  VLSI  perimeter  sorters  with  AT^=  0{N^\ogN\ogM g 


J- 


+  1  ) 


\ogM 
logN 


[^og(T/g) 

Proof:   The  bound  A  ^NlogM  is  present  to  ensure  we  can  store  all  the  inputs  while  the  ranks 
are  being  computed. 

A  block  takes  area  0(g^N^og'N/T^),  so  has  side  length  0(gN\ogN/T).    As  there  are 

—7^ — -  leaf  blocks,  the  length  of  the  circuit  is  0{(N\ogM)/T).    The  width  is  the  tree  layout 
^logA' 

height  times  the  number  of  trees  (i.e.  bus  wires)  plus  the  side  length  of  a  block,  so  the  width 


- — f- — -  +  1  ).   The  result  now  follows.   D 


It  is   worth   noting   that  the   area   bound   can  be   readily   absorbed   into   the  bound   for  T: 

g\ogN-s.T^{g^^+  g/\og^'^N){N\ogNy'^. 

As  in  the  dense  case,  perimeter  sorters  using  area  A<N\ogM  can  be  also  be  designed. 
That  is,  we  pipeline  into  a  sorter  that  sorts  A  bits  in  one  phase.  We  obtain  the  following 
results. 

Lemma      26:        For      A'logiVsA  s/^logAf ,       there      exist      VLSI      sorting      circuits      with 

AT=0{N^^\ogM{\og-^  +  \og''^N\og''^-^)). 
NlogN  NlogN 

Proof:    In  area  A    the  sorter  described  in  lemma  25   sorts  A    bits  in  time  proportional  to 

T'  =  g' N^^+  (g' NlogNy^,  where  g'  =  log — ; .    By  pipelining  the  inputs  into  this  sorter 

NlogN 

and  recomputing  partial  ranks  every  T'  steps,  the  whole  sorter  will  run  in  time  proportional 

NIorM 

to  r= =■ T' ,  whence  the  result  follows. 

A 

It  is  also  possible  to  obtain  ranking  circuits  that  use  less  area  than  the  sorting  circuits 
(for  some  values  of  T).  For  T:Sg(N\ogNy^  the  construction  of  lemma  25  suffices  without 
the  restriction  that  A>NlogM .    Moreover,  the  construction  applies  to  yet  larger  values  of  T; 
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the  only  changes  are  that  we  pipeline  further  (at  the  leaves),  and  while  we  cannot  slow  down 
the  compression  sorters  (to  run  slower  that  (NlogNy^),  we  slow  down  the  communication 
between  them.  Namely,  we  increase  the  packet  length  to  reduce  the  number  of  terrace  trees 
(roughly  as  1/7")  until  the  layout  height  of  the  tree  family  is  (iVIogyV)'^.  Further  pipelining 
only  decreases  area  in  one  dimension.    It  is  easy  to  deduce 

Lemma  27:  Let  /=  (g+ ^ )(N\ogNy^.    For  N'^^M,  for  g\ogN:sT:sf  there  exist 

log^  +  logN 


VLSI       ranking       circuits       with       AT^  =  0(,N^\ogN\ogM  g 


liog(r/^) 


+  1 


),       while       for 


f:ST:s{NlogNy^^^^^  there  exist  VLSI  ranking  circuits  with  AT=  OiNlogMiNlogNy^). 
logN 

The  limit  case  arises  when  we  have  just  one  block. 

It     is     interesting     to     note     that     we     can     build     verifier     circuits      for     which 


Ar^=0(A^2logJVlogAf 


+  1 


),  for  logN  +  g ^T ^  (NlogNy^.    In  a  verifier  circuit,  a 


,log(r/«) 

number's  rank  as  well  as  its  value  is  input,  and  it  outputs  a  single  bit  that  indicates  if  the 
ranks  are  correct.    We  use  much  the  same  structure,  providing  the  inputs  in  sets  of  logN  bits. 

The  binary  tree  will  have  — ° —  leaves,  and  each  leaf  will  contain  a  compression  sorter. 
^  logN  ^ 

First,  for  each  number,  we  broadcast  its  final  rank  to  every  leaf  using  {NlogN)/T  terrace 

trees.    Second,  in  each  leaf,  we  sort  the  numbers  according  to  their  final  rank,  in  time  0(T). 

Third,  for  each  number  we  check  whether  it  is  larger,  equal  to,  or  less  than  its  predecessor, 

based  on  the  bits  available  locally.    The  result  takes  two  bits  to  store.    We  lay  out  the  leaves 

as  the  leaves  of  a  terrace  tree,  with  path  length  T.    For  each  node  in  the  tree,  for  each 

number,  we  wish  to  compute  whether  the  number  is  larger  than,  equal  to,  or  smaller  than,  its 

predecessor,  based  on  the  bits  input  to  the  subtree  rooted  at  that  node.    But  for  all  nodes, 

except  the  leaf  nodes,  this  is  easily  computed  in  0(1)  time  per  number.    For  each  number  this 

result  is  calculated  by  combining  the  results  obtained  by  the  two  children  of  the  node.    At  the 

root  of  the  tree  we  are  able  to  check  that  the  input  final  order  is  indeed  correct,  i.e.  we  check 

every  number  is  a  to  its  predecessor. 

The  length  of  the  sorter  is  o(^y  .ioi^)  =  o(^^°S^).    The  height  of  the  terrace 

T  logA^  T 

trees  is  0(- — f- — -  +  1),  and  there  are  UNlogN)/T)  of  them;  the  width  of  the  blocks  is 
log(r/^) 

0{iNlogN)/T).     So  the  width  of  the  sorter  is  o(^^°g^    fi +  1  h.  jhus  we  have 

T       l^log(r/^)  J 

AT^=@(N^ogNlogM    ^ 

[log(r/^) 

This  result  is  of  interest  because  it  suggests  that  we  cannot  obtain  a  better  lower  bound 
for  ranking  large  numbers  with  perimeter  sorters,  based  solely  on  information  flow 
arguments.  For  such  arguments  also  apply  to  the  verification  problem,  but  we  have  given  a 
construction  showing  the  verification  lower  bound  is  tight  (up  to  a  constant  factor). 
Furthermore,  for  T:^(NlogN +  Ngy^,  the  AT^  complexities  for  ranking  and  sorting  are 
presumably  equal,  so  the  verifier  results  also  suggests  that  information  flow  arguments  are 
inadequate  for  improving  the  lower  bound  for  the  sorting  problem,  in  this  time  range.  We 
also  note  that  the  AT^  complexity  for  ranking  is  provably  at  least  as  large  as  the  AT^ 
complexity  for  read  twice  sorting. 


-)'■ 
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9.   Conclusions  and  Remarks 

This  paper  presents  a  large  variety  of  optimal  (and  a  few  nearly  optimal)  constructions 
for  VLSI  sorters,  thereby  establishing  the  tightness  of  several  area-time  tradeoffs  for  most 
relevant  number  ranges  and  times.  Circuits  are  given  for  all  M  other  than  the  number  range 
M  =  N^^'^^\  which  is  covered  by  the  earlier  works  of  [Bilardi  and  Prcparata  1984]  and 
[Leighton  1984].  For  simplicity,  we  describe  (approximately)  the  M-T  zones  for  which 
optimality  is  still  unknown.  Dense  sorters:  M~N,  \ogN^T<\ogN\og''N;  and  \ogM~N, 
loglogAf  sr<  log^loglogiV.  Perimeter  sorters:  log^  =  o(logAf),  most  times.  We  also  remark 
that  for  the  perimeter  sorters  which  contain  terrace  trees  that  consume  the  principal  portion 
of  the  circuit  area,  the  relevant  lower  bounds  (an  application  of  [Yao,  1981])  presently 
assume  a  form  of  conservative  flow  in  the  merging  circuits  corresponding  to  the  terrace  trees; 
thus  these  lower  bounds  are  not  fully  general. 

In  the  process  of  designing  these  circuits,  we  have  developed  several  new  merging 
networks  and  organizations,  of  which  the  aligned  merge  is,  perhaps,  the  most  fundamental. 
We  have  also  shown  new  and  critical  applications  for  funneled  pipelining  techniques,  and  for 
data  compression  as  well.  It  turns  out  that  our  constructions  are  typically  hybrid,  and  have 
three  parts,  as  was  discussed  in  the  introduction. 

Our  constructions  give  rise  to  a  number  of  corollaries,  including  some  that  are 
somewhat  subtle.  These  corollaries  will  be  discussed  more  fully  in  a  later  paper;  we  just  state 
the  results  here.  We  consider  three  functions  related  to  sorting:  ranking,  packet  switching 
crossbars,  and  parallel  read  only  memory.  Each  involves  a  form  of  distributed 
communication. 

The  ranking  problem  is  to  read  N  input  variables,  X^^2  •••»^n.  where  X,  €  [0,W],  and  to 
output  y,,l'2,...,l'^,  where  y,  is  the  rank  of  X,.  A  packet  switching  crossbar  functions  as  a 
variant  of  a  packet  router,  and  also  as  a  variant  of  a  scatter-gather  operation:  the  crossbar 
reads  A^  input  indices,  .Yj^2.---'^Ar'  where  X,  €  [0,7^-1],  and  iV  input  variables,  W^,W2...,Wff, 
where  W,€[0,Af],  and  outputs  Y^,Y2,...,Yfj,  where  y,  =  W^.    The  parallel  read  only  memory 

problem  is  almost  the  same  as  the  crossbar  except  that  the  data  W^  are  prcprocessed  (i.  e., 
read  in  advance,  or  built  into  the  machine). 

We  state  the  results  for  some  of  these  problems  in  the  ranges  of  principal  interest,  and 
remjirk  that  all  bounds  given  below  are  tight. 

Lemma  28:   For  N^M:SN^,  and  \ogN^iN,M/N)^T^(N/logNy^log2M/N ,  There  exist  VLSI 
ranking  circuits  with  Ar^  = /^^log^-=^. 

Lemma  29:    For  IsA/s//,  and  logN^(N ,M)^T:s(N/\ogNy^log2M ,  There  exist  VLSI  N- 
way  packet  switching  crossbars  for  logAf-bit  words  with  AT^  =  Nhog^2M. 

Lemma  30:   For  l^M^N,  and 

(^/log^)i'^log2A/^r:Smin[(— ^^)'^log^^^^,  (NlogMy^], 

logM  logAf 

there      exist     VLSI     N-way     packet     switching     crossbars     for     logA/-bit     words     with 

AT^  =  N^log^lM  +  log^^^Q^^). 

A 

Lemma    31:     For    1:SM:S//,    and    logArp(Ar,Af)srs  (AriogAf)'^max[logAf  ,log^^^]/log//, 

logAf 

there  exist  TV-way  parallel  ROM  circuits  with  Ar  =  Ar3^1ogi^A/maxriogA/, log ] 

NlogM 

Lemma  32:    For  ISAf  sAf,  and 
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(AriogAf)'^max[logAf,log7^^]/log;V:srslogAf(N/logA/)''^' 

iogM 

there  exist  //-way  parallel  ROM  circuits  with  AT  =  N^^log^^M  (logAf  +  log    °^     ) . 

IogM 

We  find  it  quite  surprising  these  problems  should  have  the  same  area-time  complexity^ 
as  sorting,  for,  say  M  =  N.  Consider,  for  example,  a  Parallel  Random  Access  Machine 
comprising  N  processors  and  N  bit-memories.  With  some  thought,  one  can  see  that  a 
("smart")  optimal  sorting  network  can  interconnect  the  N  processors  and  memories  so  that 
for  any  permutation  it  stored  among  the  processors,  the  processors  can  read  in  parallel,  with 
processor  k  reading  the  bit  in  memory  -iT(i),  where  Tt{k)  is  known  only  to  processor  k.  What 
is  remarkable  about  this  construction  is  that  the  memories  (provably)  cannot  "know"  which 
of  the  processors  requested  their  own  data,  for  the  information  flow  would  be  too  high. 
Nevertheless,  the  data  can  arrive  at  each  of  the  appropriate  processors.  Similarly,  consider 
the  ranking  problem,  where  N  processors  are  each  given  one  number  in  [0,A^];  each  processor 
can  learn  the  rank  of  its  own  number  despite  the  fact  that  the  overall  communication  across 
any  cut  line,  during  the  total  time  T ,  is  only  O(A^)  bits. 


'There  is  one  difference:  we  need  NlogN  area  for  these  problems,  rather  than  N\og2N/M. 
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Appendix  A.   The  Encoding 

The  encodings  used  to  represent  a  set  of  X  numbers  in  the  range  [0,R]  are  taken 
principally  from  [Siegel,  1984a].  To  encode  0(X)  numbers,  when  R  =X,  0(X)  bits  are  used. 
OiR\og2X/R)  bits  are  needed  when  R^X,  and  0(X\og2R/X)  bits  when  X^>R^X.  For 
R  >X^,  (or,  in  fact,  R  >X'*',)  is  not  possible  to  use  an  encoding  that  decreases  the  number  of 
bits  appreciably. 

When  R=X,  we  use  the  following  encoding.  We  use  two  types  of  bit-sized  markers, 
black  (0)  and  white  (1).  A  black  marker  indicates  an  increment  by  one,  while  a  white 
marker  indicates  an  instance  of  a  number;  we  represent  the  sorted  sequence  of  numbers  by  a 
sequence  of  these  markers.  For  example:  the  sequence  0,3,3,4,5,5  of  numbers  is  represented 
by  the  sequence  1,0,0,0,1,1,0,1,0,1,1  of  markers.  Each  marker  takes  one  bit;  there  are  X 
white  markers  and  X-l  black  markers  so  2X  -  1  bits  are  used. 

When  X^>R^X,  a  black  marker  indicates  an  increment  by  R/X,  and  uses  one  bit  (0). 
A  white  marker  represents  one  number  and  stores  the  logR/X  least  significant  bits  of  the 
number.  A  white  marker  has  a  header  bit  of  1  to  distinguish  it  from  a  black  marker;  its 
length  is  thus  1  +  logR/X.  Again,  the  numbers  are  stored  in  sorted  order  using  these 
markers.    Such  a  representation  takes  0(X\og2R/X)  bits. 

For  R^X,  a  black  marker  indicates  an  increment  by  one.  It  is  convenient  to  use  a  0  as  a 
header  for  such  a  marker  as  before,  but  we  also  reserve  logX/R  additional  bits  for  use  as 
specified  below.  A  white  marker  (1)  is  bit-sized,  and  represents  X/R  instances  of  a  value, 
rather  than  just  a  single  instance  as  before.  Thus  the  number  of  white  markers  between  two 
black  markers  indicate  how  many  multiples  of  X/R  instances  of  a  specific  value  appear  in  the 
data.  The  number  of  additional  instances  of  the  value  is  stored  in  the  \ogX/R  bits  of  the 
black  marker  following  that  value.    The  resulting  sequence  has  bit  length  0(Rlog2X/R). 

When  sorting,  it  will  be  convenient  to  store  these  encodings  in  several  sequences 
(packets),  each  of  length  0(T).  For  R=X,  this  can  be  interpreted  as  dividing  the  numbers 
into  sets,  with  each  set  holding  at  most  T  numbers  in  a  range  of  at  most  T  values.  For  R  ^X, 
the  sets  each  hold  at  most  T/\og{2R/X)  numbers  in  a  range  of  at  most  (RT)/(X\og(2R/X)) 
values.  And  for  R^X,  the  sets  each  hold  at  most  (XT)/R\og{2X/R)  numbers  in  a  range  of  at 
most  T/\og(2X/R)  values.  To  encode  each  of  these  sets  we  use  0(7")  bits,  using  the 
appropriate  peirt  of  the  encoding  of  the  whole  set,  described  above.  With  each  of  these  sets  it 
will  often  be  useful  to  record  further  information,  such  as  the  value  of  the  smallest  number  in 
the  set,  which  is  called  the  header  of  the  set.  Of  course,  such  auxiliary  information  will  not 
increase  the  packet  lengths  by  more  than  a  constant  factor. 

We  also  note  that  the  input  data  need  not  have  a  balanced  distribution  of  values,  and 
consequently,  some  encoded  packets  might,  in  fact,  be  empty,  despite  having  0{T)  bits.  In 
addition,  several  packets  might  be  required  to  account  for  all  the  numbers  present  in  a  single 
packet-range  of  values.  Nonetheless,  this  encoding  modification  increases  the  total  encoding 
length  by  only  a  constant  factor. 

For  completeness,  we  observe  that  two  other  data  representations  are  also  used. 
Sometimes  it  is  sufficient  to  represent  values  by  count,  that  is,  to  have  R  packets  of  logAT  bits, 
where  the  binary  value  in  packet  k  is  simply  the  number  of  instances  of  the  number  k  among 
the  inputs.  This  encoding,  is  within  a  constant  factor  of  optimal  when,  for  example,  R<X^^. 
It  can  also  used  as  a  suboptimal  encoding  in  some  stages  of  optimzil  merging  cascades  that  are 
outside  of  any  area-time  bottleneck. 

Finally,  to  sort  very  long  number  strings,  i.e.  where  R»X,  the  sorting  circuits  first 
compute  the  rank  of  each  number.  For  this  computation,  it  suffices  to  represent  a  substring 
of  each  of  the  the  input  values  by  the  name  of  the  variable  and  the  rank  of  the  variable's 
substring,  compared  to  the  substrings  of  identical  significance  of  the  other  variables.  See 
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sections  7  and  8. 

Appendix  B.  Merging  Schemes 

This  appendix  describes  the  merging  schemes  used  by  our  dense  sorters  for  M  =  N  and 
T=  logiVlog*^,  and  describes  their  construction  for  this  case. 

We  use  a  hierarchy  of  merging  networks,  indexed  by  the  parameter  x^(\og'N)^.    A 

network  merges  the  outputs  from  an  x   array  of  units,  each  of 

Iogx(log*^)2         logx(log'//)2 

which  outputs  an  encoded  sorting  of  N/x^  numbers  (using  0( -^)  bits).    Our  network 

x^ 

N 
outputs  a  sorted  merge  of  this  data,  encoded  in  a  representation  for  the  — - 

V  ^  '  ^  (logx(log-^)2)2 

input  numbers,  which  uses  0( °^  °^' )  bits.    The  network  runs  in  time  0(logN).    To 

(log;c(log*//)2)2' 
accommodate  our  network,  our  array  must  sustain  an  increase  in  side  length  (i.e.  the  side 

length     exclusive     of     the     side     lengths     of     the     units)     of     (at     most) 

^  l0gx(l0g*^)2 

.    We  give  a  description  of  this  merging  network  (merging  network  1) 

log//logz(log*/^)M 

in  appendix  C. 

Suppose  we  build  a  nested  hierarchy  of  these  merging  networks,  starting  with  units  for 
which  x=\ogN.  After  log*iV  -  log*log*//  levels  of  construction,  we  arrive  at  a  merging 
network  with  (\og'N)^^x^(\og'Ny.  A  unit  at  the  bottom  of  this  sequence  uses  a  circuit 
switching  sorter  that  runs  in  time  log^log*^  and  uses  area  0{N'^/{\QgNlog'N)'^).  It  also 
contains  an  encoding  device  that  uses  comparable  time  and  area,  whose  description  can  be 
found  in  appendix  D.  Each  such  unit  inputs  NAog'-N  numbers  and  produces  output  in 
encoded  form  (using  0(N\oglogN/\og^N)  bits).  Each  level  of  the  merging  networks  adds  side 
length  (less  than)  0(N/{logN (log'N)^))  to  the  full  sorting  network.  Altogether,  the  merging 
networks  add  side  length  0(N/(\ogN\og'N)).  The  initial  sort  plus  the  work  done  by  the 
merging  networks  takes  time  0(logN\og'N).   The  sort  is  not  completed,  however. 

Unfortunately,  for  x<logj:(log*^)2  the  merging  networks  are  ineffective,  in  that  they 
do  not  cause  further  merging  of  the  inputs.  The  underlying  problem  is  that  we  would  be 
trying  to  rim  the  merging  network  too  fast;  we  recall  that  logiVlog'iV  time  is  needed  to  send 
N  bits  of  information  across  a  side  of  the  sorter.  So  we  introduce  another  hierarchy  of 
merging  networks:  merging  network  2  (described  in  appendix  C).  It  has  the  following 
properties.  It  holds  consP-  units  arranged  in  a  const  x  const  array,  where  const  is  16.  Let 
each  unit  have  /  output  wires.  The  increase  in  side  length,  per  row  of  units,  due  to  the 
merging  network  will  be  0(1). 

To  ensure  that  the  total  area  for  the  hierarchy  is  small,  we  increase  the  packet  length  at 
each  level  of  this  hierarchy,  to  prevent  the  number  of  packets  (i.e.  wires)  from  being 
excessive.  The  increased  packet  length,  of  course,  slows  down  the  merges.  Optimality  is 
achieved  by  balancing  this  area-time  tradeoff. 

For  simplicity,  suppose  that  the  top  level  of  of  our  type  1  merging  network  yields  a 
(log*JV)'*x  (log'A'^)'*  array  of  units.  We  use  loglog*A^  stages  of  type  2  merging  networks.  Stage 
loglog'A/  outputs  the  final  merge  in  encoded  form;  it  has  0(N/logN\og'N)  output  wires  with 
0(logN\og'N)  bits  per  wire.    Stage  0  comprises  the  outputs  of  our  type  1  merging  hierarchy; 

there  are  (log'N)^  units,  each  with  0( loglog'iV)  output  wires,  and  O(logA^)  bits 

log//(log'A^)» 
per  wire. 
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We  select  a  bit-per-wire  count,  for  type  2  merging  networks,  that  forms  a  geometric 
sequence.  (This  choice,  of  course,  fixes  the  wire-count  of  a  unit,  which  is  proportional  to  the 
information  content  of  the  unit's  input  divided  by  the  number  of  output  bits  per  wire.) 

Specifically,  stage  k  comprises  \  °h      )  units,  each  with 

256 

0( — 128*(loglog'^  -  k))  output  wires,  and  0(2*log^)  bits  per  wire.    Thus  the 

logA^(log*A^)^ 

side       length       used       by       (the       i-2£_ — L.       rows       of       units       in)       stage       k       is 

16* 

0{ — 8*^(loglog•^■  -  k)).     The    stage    runs    in    time    0(2*logA^).     This    merging 

logiV(log'Af)'' 
network  is  described  in  appendix  C. 

We  can  now  deduce 
Lemma  3:   For  A/ =A^,  7=  logiVlog*/^,  there  exist  VLSI  sorters  with  AT^  =  ©(iV^). 

Proof:  The  running  time  of  the  type  1  merging  networks  is  OilogNlog'N)  in  total,  since  each 
one  of  them  runs  in  time  O(logiV).  The  running  time  of  the  type  2  merging  networks  is  also 
0(\ogN\og'N),  the  sum  of  a  simple  geometric  series.  The  initial  sorters  also  run  in 
0(logNlog'N)  time,  so  the  total  execution  time  is  just  0{\ogN\og'N). 

Our  sorter  has  a  side  length  comprising  three  terms.    The  length  of  the  initial  sorters  is 

proportional  to  logA^x( )  =  N/\ogN\og'N.    The  additional  side  length  induced  by 

log-A^log'^V 
the  type  1  merging  networks  is  0{N/\ogN\og'N).    The  additional  side  length  due  to  the  type 
2  merging  networks  form  a  (modified)  geometric  series  with  total  0(Ar/logiVlog*iV).    Hence 
the  total  area  used  is  C»(iVV(log//log*A^)^).    The  result  follows.    D 

Appendix  C.  Aligned  Merging 

This  merging  scheme  is  used  whenever  M<N.  We  include  it  in  this  appendix  for 
completeness.  Its  description  is  the  subject  of  section  2. 

Merging  Network  1 

Let  x&  (log'A^)^  be  a  parameter,  with  XogN'^x.    A  type  1  merging  network  for  numbers 

lying  in  [0,N'\  is  a  box  containing  an  array  of  x   units,  that  each 

\ogx{\og'NY         \ogx{\og'NY 

output  an  encoding  lor  N/x^  sorted  inputs  (using  C'(A^logx/x^)  bits).    The  network  merges  the 

N 
outputs  of  the  units,  outputing,  in  encoded  form,  the  sort  of  its  inputs  (using 

(Xogxiyog'NfY 

0{ ^ — ^ )  bits).    The  merging  network  runs  in  time  0{\ogN).    The  increase  in  side 

(logx(log'//)')2 
length  of  the  array  due  to  the  presence  of  the  merging  network  (i.e.  the  side  length  exclusive 


of  the  side  length  due  to  the  units)  is 


o\ ^ 1. 

llog/^logx(log'//)^J 


The  merge  proceeds  in  two  phases.    First,  for  each  column,  we  merge  the  outputs  of  the 

units   in   the   column;  second,   we  merge  the  outputs   from    the   first  stage. 

logx(log-^)2 

Finally,  we  recode  the  outputs  into  a  more  compact  form.    We  describe  the  merge  in  the  first 
phase;  the  merge  in  the  second  phase  is  similar.    The  recoder  is  described  in  appendix  D. 

The  outputs  from  each  of  the  units  are  produced  in  packets  of  0{\ogN)  bits  per  wire; 
each  packet  is  preceded  by  a  (log//)-bit  header  that  equals  the  value  of  the  smallest  number 
stored  in  the  packet.    The  encoding  satisfies  the  following  local  uniformity  criterion:  a  packet 
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encoding  sorted  numbers  from  a  set  of  N/x^  inputs  represents  at  most  \ogN/\ogx  numbers  in  a 
range  of  x-\ogN /\ogx .  Thus  we  use  Nlogx/(log^x^)  output  wires  from  each  unit  in  the  array. 
The  output  packets  from  each  unit  in  the  column  are  fed  into  a  circuit  switching  sorter  having 
N /{x\ogN (log'N)')  input  wires,  i.e.  one  input  wire  per  packet.  The  packets  are  sorted  by 
header  in  time  0(\ogN).  For  the  moment  we  assume  that  for  each  input  value,  no  unit  in  the 
column  has  more  than  2  packets  containing  the  value.  (We  will  soon  show  how  to  drop  the 
assumption.)  The  local  uniformity  criterion  guarantees  that  each  packet  of  inputs  has  a  range 
less  than  .Slog^N;  it  can  overlap  with  the  values  in  at  most  log*N  packets.  Thus  a  'near  sort' 
has  been  achieved,  and  a  little  tidying  up  will  complete  the  merge. 

We  now  sort  groups  of  llog'^N  consecutive  packets.  It  suffices  to  uncode  each  packet 
into  0{logN)  numbers;  we  use  a  circuit  switching  sorter  running  in  time  O(log^)  to  sort  the 
numbers,  and  then  reencode  them.  (Each  of  these  sorters  uses  area  0(\og^°N);  it  is  easy  to 
lay  them  out  in  the  available  N'^/[x\ogN (log'N)^]^  area.)  This  process  is  performed  twice;  first 
we  sort  groups  consisting  of  the  packets  in  positions  2;klog*A^+ J,  l^i^2log'^N ,  k=l,2,  •  •  ■  ; 
then  we  sort  groups  consisting  of  the  packets  in  positions  (2Jk+ l)log''iV  + i,  l^i^2\og*N, 
jfc=  1,2,  •  •  •  .    At  this  point  the  numbers  are  in  sorted  order  (in  encoded  form). 

There  are  many  ways  to  remove  the  assumption  that  for  each  value,  no  unit  has  more 
than  2  packets  containing  the  value.  For  example,  since  each  level  of  the  merging  structure  is 
a  perfect  sort,  it  suffices  to  mark  those  packets  that  contain  just  one  value  and  that  have,  in 
the  Scime  unit,  both  preceding  and  succeeding  packets  that  also  contain  the  value.  The 
unmarked  packets  are  then  merged  as  described  above.  (The  steps  needed  to  partition 
marked  and  unmarked  packets  and  to  route  them  to  specific  input  ports  of  a  sorter  are  based 
on  sorting  and  can  be  found  in  section  2.) 

Next,  another  circuit  switching  sorter  is  used  to  merge  by  header  the  sorted  unmarked 
packets  with  the  marked  packets.  Once  sorted,  strings  of  consecutive  marked  packets  will 
have  values  in  common  only  with  their  immediate  unmarked  predecessor  and  successor 
packets.  It  is  a  simple  matter  to  use  a  binary  tree  to  distribute,  in  O(log^)  time,  the 
unmarked  predecessor  packet  to  the  appropriate  marked  packets.  Each  marked  packet  will 
absorb  the  data  that  is  no  less  than  its  header  and  is  less  than  its  successor's  header,  and  the 
unmarked  packet  is  truncated  appropriately.  Thus  unmarked  packets  will,  in  general,  be 
dissected  and  distributed  among  the  marked  packets,  which  may  then  acquire  new  values,  and 
which  may  double  in  length.  Further  sorting  operations  are  used  to  recode  the  packets,  since 
some  may  now  be  too  long,  and  some  may  be  too  short.  These  recoding  steps  are  addressed 
in  appendix  D. 

Merging  Network  2 

Let  Jfc  be  a  parameter,  with  l^;t^loglog'^,  and  lctj  =  k  —  l.  A  stage  k  type  2  merging 
network  for  numbers  lying  in  [0,N]  is  a  box  containing  a  16  x  16  array  of  units,  each  of  which 

N 
has  12&'(loglog*/^-y)  =  /  output  wires.  Each  wire  carries  a  packet  of  length 

\ogN(\og'N)^ 
0(2hogN)  =  b  bits.    The  network  merges  the  outputs  of  the  256  units,  putting  them  into 
sorted  encoded  form,  in  time  0(\ogN  +  b).    The  increase  in  side  length  due  to  the  merging 
network  is  at  most  0{l). 

We  use  an  8  phase  merge.  In  each  phase  we  merge  the  outputs  of  2  mergers  from  the 
previous  phase  (in  the  first  phase  we  merge  the  outputs  of  two  units).  At  the  end  of  the 
merging  phases  we  recode  the  packets,  and  redistribute  the  outputs  so  that  there  are  2b  bits 
per  wire,  as  described  in  appendix  D.  Each  phase  uses  a  method  similar  to  that  used  by 
merging  network  1.  The  difference  is  that  going  from  a  'near  sort'  to  a  complete  sort  is 
simpler.  We  first  sort  all  the  packets  by  header.  Let  5,  and  S2  be  the  sets  of  packets  we  are 
merging.  If  a  packet  P^  from  S^  {Sj,  respec.)  has  a  a  packet  in  S2  {S^)  as  its  immediate 
successor,  then  Pj  will  be  truncated  and  distributed  to  the  string  of  consecutive  packets  from 
^2    (5,)    that   immediately    follow    it.    It   suffices    to    use    a   binary    tree    to    instantiate    the 
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distribution.    The  data  management  details  are  the  same  as  for  a  type  1  network.    Finally,  we 

note  that  packets  contain  0(b)  bits,  so  the  sorting  and  recoding  operations  take  0(\ogN  +  b) 

time. 

Merging  Network  3 

Let  X  be  a  parameter.  A  type  3  merging  network  for  numbers  lying  in  [0,N]  is  a  box 
containing  a  4x4  array  of  units,  each  of  which  output  an  encoding  for  A^/(64j:-')  sorted 
inputs.  The  merging  network  accumulates  4  waves  of  outputs  from  these  units,  and  produces 
an  encoded  representation  for  the  sort  of  these  N/x"^  input  values.  The  outputs  of  each  of  the 
16  units  arrive  on  N^^/{S>x^^)  wires;  the  outputs  of  the  merging  network  leave  on  N^'^/x^^ 
wires.  The  increase  in  side  length  due  to  the  merging  network  is  OiN^^/x^'^).  The  running 
time  of  the  network  is  the  maximum  of  0{N^^\ogx/x^^)  and  the  time  for  the  four  waves  of 
inputs  to  be  produced  by  the  units. 

The  merge  has  two  stages:  first,  for  each  of  the  units,  its  four  waves  of  output  are 
merged  in  three  merging  operations;  then,  these  16  sets  are  merged  together  pairwise  as  for  a 
type  2  network.  Finally,  the  outputs  are  distributed  over  the  correct  number  of  output  wires. 
It  is  easy  to  see  that  the  network  has  the  required  size  and  running  time. 

Merging  Network  4 

A  type  4  network  merges  numbers  lying  in  [0,A/],  where  M<N.  It  is  parametrized  by  x, 
where  x^M.  The  network  is  a  box  containing  an  8x  8  array  of  units,  each  of  which  outputs 
an  0(Af  log(2a:/Af))-bit  encoding  representing  the  sort  of  x  inputs.  The  merging  network 
produces  a  0(A/log(128j:/Af ))-bit  encoding  representing  the  sort  of  its  64x  inputs.  Let 
p  =  l/61og(2j:/Af).  The  merging  network  produces  its  output  on  M-Ai'*^/T  wires,  in  time 
0(max[r-(l/2y *',logx]).  The  increase  in  side  length  due  to  the  merging  network  is 
o{m  ■^^'^/T).  We  note  that  for  x2s64Af ,  the  bit  length  of  the  output  of  the  merging  network 
is  at  most  twice  the  bit  length  of  the  output  of  one  unit. 

The  merging  is  again  performed  pairwise  as  for  a  type  2  network.  In  this  case, 
however,  the  recoding  is  performed  by  the  second  type  of  recoder  described  in  appendix  D, 
since  M<N . 

Merging  Network  5 

A  type  5  network  merges  numbers  lying  in  [0,A/],  where  M<N.  It  is  parametrized  by  x, 
where  xS:  (log'M)^  be  a  parameter,  and  2'^'-^°^'^^^T\og'M .    The  network  is  a  box  containing 

an  (2*'('°8'^Vx)  X  (2^('°8'^Vx)  array  of  units,  each  of  which  outputs  on  — — —  wires  an 

Tlog'M 
0(Af  logx)-bit  encoding  representing  the  sort  of  Mx^  inputs.    The  merging  network  outputs  its 

^4x/(iog-AO^  inputs  in  encoded  form  (using  C>(A/x/(log*A/)2  bits).    It  runs  in  time  0(T/log'M), 

M2'^t'°8''^^ 
where   logAf  log  A/^Jslog^A/,   and   has   output  wires.     The   increase   in  side 

Tlog'M 


length  due  to  the  merging  network  is  O 

Tlog'M 

The  merging  is  accomplished  in  two  phases  as  for  the  type  1  merging  network.  We 
appear  to  need  aligned  merges  for  the  first  merging  phase.  The  outputs  from  each  column  of 
units  are  combined  by  an  aligned  merge,  that  is,  for  each  column,  M x2^^^°^'^^  numbers  are 

merged  in  an  encoded  form  that  uses  0(Afx/(log*A/)^)  bits.    Thus  there  are  — ^^ —  channels 

riog'Af 
per  column.    The  increase  in  side  length  due  to  the  aligned  merging  networks  for  all  the 

Mx         2^('°«*'^^  yV/2^('°«''^>^ 

columns  is  thus =  .    We  remark  that  there  is  one  difficulty 

Tlog'M  X  Tlog'M 

encountered  in  the  aligned  merge  of  section  2  that  can  be  avoided  here.    The  difficulty  was 
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that  the  set  of  packets  belonging  to  a  single  breakpoint  range  might  contain  too  many  items, 
so  that  several  channels  might  be  requires  to  merge  the  numbers  in  the  range.  For  the 
current  aligned  merge,  a  packet  output  by  a  unit  will  have  an  encoding  that  enables  it  to 

contain  up  to  2'=^^'°8''*^'-7"log*Af  numbers  in  a  range  of  at  most  — ^ values,  since  that 

encoding  is  optimal  for  the  number  of  inputs  comprised  by  a  column.    An  entire  column, 

therefore    can  have  at  most Tlog'M  numbers  belonging  to  a  single  breakpoint 

X 

range.  Thus  it  suffices  to  use  an  encoding  for  the  units  that  is  optimal  for  a  total  of 
3/4x/(iog"AO'  numbers  in  a  range  of  M  values,  since  this  encoding  uses  only  twice  as  many  bits 
as  the  encoding  we  would  otherwise  have  used.  Consequently  the  aligned  merge  will  give  a 
perfect  sort  rather  than  just  a  near  sort. 

The  second  phase  proceeds  in  a  similar  way  to  the  phases  in  merging  network  1.  That 
is,  a  circuit  switched  sorter  is  used  to  sort  the  packets  by  header,  in  time  logAf  +  T/log'M .  To 
complete  the  sort,  we  mark  packets  that  contain  just  one  value  and  that  have  both  preceding 
and  succeeding  packets  that  also  contain  the  value.  The  unmarked  packets  are  then 
processed.  Now  an  unmarked  packet  can  overlap  at  most 

2  Z^^'""'"^' .  rioR-Af  ^  fzrioR'M  1   ^2\og^M    other    packets.     We    sort   groups    of    41og^M 

X  X  y       X       ) 

contiguous  packets,  as  for  merging  network  1.  To  carry  out  this  sort,  we  expand  each 
marker  in  each  packet  into  a  value  and  a  count,  these  being  the  value  and  count  represented 
by  the  marker.  There  are  ^T/log*M  ^^  {\og^M)/2  markers  per  packet,  or  a  total  of  at  most 
21og*A/  markers  per  group,  and  each  count  represents  at  most  M^  numbers.  Thus  each 
(expanded)  marker  uses  0(logAf )  bits.  We  sort  the  markers  with  a  circuit  switched  sorter 
that  takes  0(log'^A/)  area  and  runs  in  time  logM .  A  binary  tree  is  used  to  count  how  many 
numbers  of  each  value  are  present.  The  results  are  then  recoded  into  new  sorted  packets,  in 
a  further  logM^T/log'M  time.  The  marked  packets  are  handled  analogously  to  those  for  a 
type  1  merging  network. 

Merging  Network  6 

A  type  6  merging  network  is  a  unit  that  takes  its  inputs  from  two  units  to  its  left.  One 
wave  of  output  from  this  network  comprises  the  merge  of  two  waves  of  outputs  from  the  two 
units.  Thus  in  a  binary  tree  of  type  6  merging  networks,  the  running  time  per  wave  (i.e.  the 
period)  doubles  at  successive  levels  of  the  tree.  The  packet  length  also  doubles,  and  the  wire 
count  is  the  number  of  packets  needed  to  carry  the  encoded  data.  (The  wire  count,  therefore, 
roughly  doubles  or  halves,  depending  on  whether  M^N  or  not.) 

The  network  accumulates  in  a  buffer  its  four  sets  of  inputs  (as  packets),  and  then  sorts 
and  merges  them  as  does  a  type  2  merging  network.  The  running  time  is  proportional  to  the 
length  of  the  output  packets.  The  network  uses  circuit  switching  sorters  with  dimensions 
proportional  to  the  wire  count.  Thus  the  width  is  proportional  to  the  wire  count,  while  the 
length  is  proportional  to  the  packet  length,  to  accommodate  the  necessary  storage. 

Appendix  D.  Recoders 

First,  we  consider  the  problem  of  recoding  /  packets  containing  altogether  P  numbers  in 
a  range  of  Q  values,  Q^P ,  where  the  numbers  are  currently  coded  in  too  sparse  a  form,  that 
is  as  if  P  were  smaller  than  it  actually  is.  So  suppose  we  are  given  the  numbers  encoded  in 
i>-bit  packets  on  /  wires.  We  show  how  to  recode  them  using  a  circuit  of  size  0{l)  by 
0(1  + b),  in  time  C>(log/  +  i>).  The  outputs  appear  on  k  wires  in  packets  of  b  bits  each,  k^l. 
(The  choice  of  b  implicitly  defines  k;  it  is  an  easy  matter  to  increase  or  decrease  the  packet 
length;  just,  respectively,  reduce  or  increase  the  number  of  packets.) 
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We  introduce  the  i/2  dummy  values  i{2Q)/k,  i=l,2,  •  •  ■  ,  and  use  a  circuit  switched 
sorter  to  merge  them  with  the  /  packets  that  are  the  inputs.  We  use  the  header  in  each  packet 
as  its  key.  If  there  is  a  tie  between  a  header  and  a  dummy,  the  header  is  defined  to  be 
larger.  This  sort  will  take  time  0{logl  +  b),  on  a  sorter  using  area  0(1(1+ b))  (the  'b'  term 
allows  for  the  case  that  b^l). 

We  note  that  the  range  of  each  input  packet  is  at  most  2Q/k  (in  fact  it  is  0(Q/l)),  except 
in  the  case  that  the  numbers  are  not  encoded  at  all  (we  handle  this  case  below).  For  each 
dummy  there  is  just  one  packet  whose  values  may  both  precede  and  follow  it:  the  packet,  if 
any,  immediately  preceding  it  in  the  sorted  order  just  obtained.  If  there  is  such  a  packet  we 
partition  it  into  two  packets:  those  numbers  preceding  the  new  value,  and  the  remainder. 
This  remainder  is  stored  with  the  dummy,  which  becomes  its  header.  Next,  we  recode  the 
items  in  each  packet:  that  is,  we  introduce  more  black  markers,  and  for  each  white  marker 
we  reduce  the  number  of  bits  stored.  This  can  be  done  in  a  single  sweep  over  each  packet,  in 
time  0(b). 

We  handle  the  case  that  the  numbers  are  not  initially  encoded  as  follows.  As  before, 
we  use  a  circuit  switched  sorter  running  in  time  0(b)  to  sort  the  dummies  together  with  the 
input  packets  (the  key  for  a  packet  is  the  first  number  in  the  packet).  This  sorter  uses  area 
0(l(l  +  b)).  We  then  distribute  copies  of  each  packet  to  every  dummy  with  which  it  overlaps; 
for  each  dummy  D  we  remove  items  from  the  associated  packet  so  that  the  remaining  items 
all  lie  between  the  value  of  D  and  the  value  of  the  next  dummy.  (This  can  easily  be  done 
with  the  help  of  a  tree  of  height  log/.) 

Next,  the  numbers  in  each  packet  are  recoded  by  a  sweep  across  the  packet;  this  entails 
the  introduction  of  more  black  markers,  and  a  reduction  in  the  number  of  bits  stored  per 
white  marker.  It  will  take  time  0(b).  There  are  still  /,  rather  than  k  packets,  however.  We 
need  to  choose  new  headers:  they  will  be  the  k/2  dummy  values  we  introduced,  and  every 
l/(2k)th  current  header.  For  each  packet,  its  new  header  is  the  nearest  preceding  header  in 
the  sorted  order.  We  copy  each  newly  encoded  packet  to  its  new  header.  There  are  k  new 
headers,  each  of  which  has  at  most  b  bits  copied  to  it.  This  copying  takes  0(b)  time.  The 
problem  of  routing  the  new  packets  to  a  designated  set  of  Jt  output  wires  is  solved  by  further 
sorting  as  in  section  2.    Similarly,  a  tree  can  be  used  to  find  more  precise  breakpoints. 

Second,  we  consider  the  problem  of  receding  P  numbers  in  the  range  Q,  Q^P,  to  a 
denser  encoding  (the  initial  encoding  supposes  there  are  at  least  Q  numbers  present). 
Suppose  the  numbers  arrive  on  v  wires,  in  packets  of  length  b,  and  they  are  to  depart, 
recoded,  on  w  wires,  in  packets  of  length  b,  where  v^w.  We  achieve  this  in  time 
0(logv  +  b),  using  a  circuit  taking  area  0(v(v  +  b)). 

We  introduce  dummy  headers:  every  2Q/wth  value,  and  every  2v/wth  current  header. 
We  note  that  the  set  of  numbers  between  two  adjacent  dummy  headers  can  be  encoded  using 
b  bits.  Using  a  circuit  switched  sorter,  we  sort  the  dummy  headers  and  the  packets;  the  key 
for  a  packet  is  its  header.  We  break  ties  by  defining  the  dummy  headers  to  be  larger.  We 
note  that  the  only  packet  that  may  overlap  with  a  dummy  header  is  the  one  immediately 
preceding  it  in  the  sorted  order.  We  divide  such  packets  into  two  new  packets:  the  values 
preceding  the  dummy  header,  and  the  remainder  (which  take  the  dummy  header  as  their 
header).    This  sorter  runs  in  time  0(logv  +  b),  and  uses  area  O(v^). 

Suppose  the  current  encoding  uses  counts  of  at  most  x  bits;  we  are  switching  to  an 
encoding  allowing  counts  of  up  to  y  bits,  y^x.  We  start  by  performing  a  sweep  over  each 
packet  to  change  the  counts  to  the  new  form,  performing  permissible  additions;  this  takes 
time  0(b).  Then,  for  those  sequences  of  contiguous  packets  that  hold  just  one  count  for  the 
same  value,  we  combine  the  counts  using  a  tree,  in  time  O(logv);  this  tree  takes  area 
O(vlogv)  (of  course,  we  do  not  allow  the  new  counts  to  exceed  y  bits).  Finally,  by 
concatenating  the  old  packets,  we  form  new  packets  in  a  further  0(b)  time. 
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Our  third  recoding  construction  is  used  in  section  6.  (Similar  recodings  are  used  in 
section  5;  the  details  are  left  to  the  interested  reader).  The  input  consists  of  M  numbers  in  the 
range  0  to  M,  stored  in  (Af/logA/)'^  packets,  of  (MlogM)'^  bits  each.  We  wish  to  recode 
these  packets  to  a  coding  suitable  for  N  numbers  in  this  range  {N^M),  where  the  numbers 
are  to  be  held  in  (A//logM)"^log2///A/  packets,  each  of  (A/logAf)'^  bits.  The  available 
network  has  size  (MlogAf)'^  by  (Af/logM)''^log2^/M.  The  time  available  for  the  recoding  is 
(Af logAf)''^.  For  each  of  our  original  set  of  packets  we  create  log2N/M  new  packets;  if  the 
original  packet  had  header  h,  then  the  new  packets  have  headers  with  values 
;i  +  /(AflogM)''^/log(2iV/A/),  for  O^i^loglN/M -1.  Each  new  packet  is  to  take  those 
numbers  from  the  old  packet  whose  values  lie  in  the  range  between  the  new  packet's  header 
and  the  next  larger  new  header.  This  is  easily  done  in  (Af  logA/)^'^  time,  by  first  duplicating 
the  original  packet  loglN/M  times,  and  then  for  each  instance  of  the  packet,  eliminating  all 
but  the  appropriate  range  by  a  sweep  over  the  packet.  Finally,  we  expand  the  remaining  part 
of  each  packet  into  the  new  encoding;  this  takes  time  proportional  to  the  length  of  the  output, 
which  is  at  most  (MlogA/)''^. 

Appendix  E.  Column  Counters 

Column  counters  are  used  to  combine  the  counts  of  values  present  £imong  blocks  of  a 
perimeter  sorter  in  the  case  M«N . 

We  begin  our  description  of  these  merging  networks  by  presenting  a  solution  to  the 
following  problem.  Given  N/T  input  ports  lying  on  a  vertical  line,  count,  in  time  0(J),  how 
many  ones  are  input  to  all  the  ports  (i.e.  add  the  bits)  over  time  0(r),  using  a  circuit  of  area 
0{N/T),  where  T'^'Z.N .  We  are  allowed  to  separate  the  input  ports  as  we  wish,  subject  to  the 
condition  that  the  total  area  (and  hence  length)  is  0(N/T). 

This  problem  would  be  quite  simple,  were  there  sufficient  width  to  use  a  binary  tree  of 
adders.  The  solution  is  to  use  an  essentially  one  dimensional  pipelined  implementation  of 
such  a  tree.  As  a  consequence  of  the  severe  data  flow  limitations,  the  one  dimensional 
version  is  obliged  to  implement  a  "poor"  binary  tree  organization  where  a  node  always  adds 
bits  of  a  fixed  value,  and  more  significant  bits  are  passed  up  the  tree.  Then  a  flushing  phase 
is  used  to  accumulate  the  data  distributed  among  the  tree  nodes. 

We  use  a  vertical  column  of  2N/T-1  units  of  width  0(1),  connected  by  two  shift 
registers,  each  of  width  0(1).  One  shift  register  transmits  data  upward,  and  one  downward. 
The  units  are  numbered  consecutively  from  0  to  N/T— 2.  Box  number  Jt,  where 
Jt  =  (2r+  1)2'"'  is  said  to  be  at  level  /,  so  that  every  second  unit  is  a  level  1  unit  and  every 
fourth  unit  is  a  level  2  unit,  etc.  Each  level  1  unit  is  connected  to  an  input  port.  Box  k  can 
read  and  write  a  bit  on  both  the  upward  and  downward  k'th  shift  register  units  (the  reading 
and  writing  occur  in  adjacent  cells  of  the  shift  register,  the  read  cell  preceding  the  write  cell, 
with  respea  to  the  shift  direction).  We  think  of  the  units  as  being  organized  in  a  tree:  that  is, 
the  two  children  of  a  level  i  unit,  j>  1,  are  the  two  nearest  level  i-1  units.  Each  unit  knows 
which  way  its  parent  is  (up  or  down),  and  the  root  is  distinguished.  The  lengths  of  the  units, 
however,  are  not  uniform:  a  level  j  unit  has  length  0(i).  Since  there  are  N/T  level  1  units,  the 
total  length  of  the  units  is  0(N/T). 

A  level  J  unit  counts  sets  of  4'"'  inputs  of  ones;  it  has  3  storage  bits  for  this  count  (i.e. 
it  can  count  up  to  7  sets).  Each  level  /  unit  has  a  clock  using  0(j)  bits  that  counts  up  to  Z'"*"'. 
All  clocks  start  at  zero  at  the  same  time.  We  remark  that  it  is  not  necessary  to  start  all  the 
clocks  at  the  same  time,  although  the  solution  to  this  problem  is  well  known.  It  suffices  to 
ensure  that  their  zeros  are  synchronized.  This  can  be  accomplished  with  an  extra  shift 
register,  which  takes  4  time  steps  to  get  from  one  unit  to  the  next.  Initially,  a  flag  bit  travels 
down  the  register,  starting  from  the  top  unit  (a  level  1  unit).  A  level  i  unit  is  passed  at  time 
r-Z'*',  for  some  r,  which  is  an  appropriate  time  to  set  that  unit's  clock  to  zero.    Non-zero 
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inputs  are  allowed  to  occur  only  after  the  bit  has  traversed  the  whole  shift  register. 

Each  time  its  clock  comes  back  to  zero,  a  unit  must  send  a  count  bit  to  its  parent:  a  zero 
if  it  has  not  accumulated  4  sets  of  4'"'  ones,  and  a  one  otherwise;  in  the  latter  case,  its  local 
count  of  sets  accumulated  is  reduced  by  4.  A  level  1  unit  takes  its  input  at  every  time  step 
from  its  input  port.  For  />  1,  a  level  /'  unit,  each  time  its  clock  is  at  (2'"'—  1)  mod  2',  reads 
its  inputs  (one  from  each  shift  register)  and  adds  these  values  (each  a  zero,  or  a  one)  to  its 
local  set  count.  At  most  four  sets  are  counted  between  a  unit's  transmissions  to  its  parent,  so 
three  bits  are  sufficient  for  the  local  counter.  We  need  to  show  that  no  count  bit  is  ever 
overwritten  before  it  is  read,  and  that  the  bits  are  read  by  the  proper  units.  It  suffices  to 
note  that  bits  not  yet  read  pass  the  output  terminal  of  a  level  :  unit  only  at  times  equal  to 
2'  mod  2'^',  and  so  never  conflict  with  the  bits  output  by  the  level  i  unit  (which  are  output  at 
times  equal  to  0  mod  2'^'). 

At  the  conclusion  of  the  input  phase  (time  T),  the  count  of  the  number  of  ones  input 
will  be  accumulated  in  a  distributed  manner.  It  is  a  simple  matter  to  sum  the  partial  counts 
to  obtain  a  single  number.  The  root  can  synchronize  the  leaves  (level  1  units)  to  pass  their 
counts  up  the  tree.  It  is  not  difficult  to  have  the  counts  passed  least  significant  bit  first,  and  to 
adjust  for  the  missing  lesser  significant  bits  among  the  higher  level  units.  The  addition  is 
done  systolically.    The  time  for  this  phase  is  0{N/T)  =  0(T). 

We  now  generalize  the  problem.  In  each  case,  the  circuit  is  to  have  0(1)  width. 

We  are  given  N/kT  input  ports  lying  on  a  vertical  line.  Inputs  are  k-hit  binary  numbers, 
occurring  at  one  bit  per  timestep.  We  wish  to  add  the  values  input.  The  solution  is 
essentially  the  same:  the  circuit  runs  in  time  0(7"),  and  has  length  0(N).  The  shift  registers 
have  k  bits  per  unit  rather  than  one,  and  the  units  have  Jt+ 3-bit  count  capacity.  Time  is  k 
times  slower;  local  clocks  have  logJfc  more  bits.  Addition  is  ripple  (systolic). 

Our  column  counter  problem  has  N/krT  input  ports  lying  on  a  vertical  line.  Inputs  are 
i-bit  binary  numbers,  occurring  at  one  bit  per  timestep,  but  now  the  t-bit  counts  at  each  port 
are  multiplexed  as  an  r  vector,  so  that  the  bits  input  at  the  ports  during  timesteps 
[rks  + lk+l,rks+ lk  + k]  comprise  N/krT  Jfc-bit  terms  for  the  /'th  coordinate.  The  problem  is 
to  sum  the  N/(kr)'^  vectors  using  area  0{N/T),  where  T'^^N.  We  use  essentially  the  solution 
given  above,  except  that  each  storage  unit  is  replicated  (vertically)  for  a  total  of  r  stages. 
Thus  each  unit  has  r  subunits,  each  of  which  keeps  a  count,  but  it  suffices  to  use  just  one 
clock  per  unit.    The  details  are  straightforward  and  are  omitted. 

Appendix  F.  Terrace  Trees 

A  terrace  tree  of  depth  T  is  an  N-\eaf  binary  tree  subject  to  the  following  conditions. 

(1)  The  layout  area  is  minimum  (up  to  a  constant  factor). 

(2)  The  path  length  from  any  leaf  to  the  root  is  at  most  T. 

(3)  The  leaves  all  lie  on  a  straight  line. 

The  parameters  h  (layout  height)  and  d  (degree)  will  be  important  in  our  construction;  they 

satisfy /.^=r  and  d''  =  N;  that  is,  h  =  0( ^-^^ ). 

log{2TAogN) 

We  construct  an  A^-leaf  <f-degree  terrace  tree  as  follows.  First,  a  complete  ^-leaf  d-sny 
tree  is  laid  out,  with  each  node  of  a  given  level  at  the  same  height  in  the  layout.  We  now 
replace  each  d-aiy  node  by  a  chain  of  d-l  binary  nodes  at  the  same  layout  height  above  the 
leaves;  the  last  node  on  the  chain  receives  two  descending  edges  from  the  d-aiy  node,  the 
other  nodes  receive  one  descending  edge  each,  and  the  first  node  receives  the  ascending 
edge.    We  note  that  the  path  length  from  any  leaf  to  the  root  is  at  most  dh,  as  required. 
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The  second  minimum  area  problem  we  consider  is  how  to  lay  out  an  iV-node  terrace  tree 
C  that  emulates  a  complete  binary  tree  B  o{  N  nodes;  the  requirements  for  the  emulator 
graph  C  are  formalized  as  follows. 

Let  B  =  (V,  E),  and  C  =  (W,  F).  For  v  €  V,  let  P(v)  be  the  parent  of  v.  Let  m  iV-W  be  the  1 
to  1  map  associating  nodes  of  C  with  those  of  B.  We  say  that  C  is  a  terrace  tree  emulator  of 
B  if 

(1)  C  is  a  terrace  tree. 

(2)  Each  leaf  of  B  is  mapped  by  m  into  a  node  of  C  that  is  a  leaf  or  that  has  only  one 
child. 

(3)  For  V  €  V,  the  length  of  the  path  from  m(v)  to  m(P(v))  in  C  is  at  most  d. 

(4)  If  m{v)  and  m(/'(v))  are  at  the  same  layout  height  in  C ,  then  the  interior  nodes  on 
the  path  from  m(v)  to  m{P{v))  in  C  represent  (under  m)  descendants  of  v  in  fl. 

We  use  the  terrace  tree  described  above  to  construct  C.  It  remains  to  associate  nodes  of 
the  binary  tree  with  nodes  of  the  terrace  tree.  Without  loss  of  generality  let  if  be  a  power  of 
two  (otherwise  round  down).  We  associate  the  top  d-\  nodes  of  the  binary  tree  with  the 
nodes  at  the  top  layout  height  (i.e.  the  top  chain  of  d—\  vertices)  of  the  terrace  tree.  The 
nodes  are  arranged  in  the  chain  according  to  an  inorder  traversal  of  the  (rf— l)-node  subtree 
of  fl.  Consider  the  d  subtrees  of  B  resulting  from  deleting  the  top  d-\  nodes  in  fl.  These 
subtrees  Jire  recursively  associated  with  d  subtrees  in  C,  namely  the  subtrees  at  the  ends  of 
the  edges  descending  from  the  chain  oi  d-\  vertices  at  the  top  of  C.  The  subtrees  of  B  are 
associated  with  the  subtrees  of  C  according  to  a  simultaneous  inorder  traversal  of  fl  and  C. 
It  is  clear  that  the  path  length,  in  the  terrace  tree,  from  any  node  to  its  parent  (with  respect 

to  fl)  is  at  most  - — =-  +  2  ^  d. 

A  grounded  terrace  tree  G  is,  with  a  few  modifications,  a  terrace  tree  emulator  of  a 
complete  binary  tree.  The  modifications  are  as  follows.  The  tree  G  has  two  kind  of  nodes, 
type  S  and  type  P.  Leaf  nodes  are  type  P.  Internal  nodes  are  type  S.  Each  type  S  node  w 
has,  in  addition  to  its  two  original  descendents,  one  other  descendent,  a  type  P  node  located 
along  the  line  of  leaves  at  the  point  of  intersection  with  the  line  that  is  perpendicular  to  the 
leaf-line  and  that  goes  through  w. 
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